[GE users] Tools segfault for users in LDAP

shaas stefan.haas at sun.com
Tue Jan 5 14:49:32 GMT 2010


This is a known issue.

The NSS module (Name Service Switch of glibc) does not support custom 
malloc implementations, as the lib is always opened with the 
RTLD_DEEPBIND-flag. Therefore the lib uses always the standard 
malloc-libs and NOT the ones defined by the main-program (in our case 
the jemalloc-lib).

Therefore I've interposed glibc-hooks in our jemalloc-library in 6.2u5.
Unfortunately, the free()-hook is missing there.
So for a quick fix just get the newest revision (1.6) of our jemalloc.c 
from maintrunk or simply add at "static void sge_init_hook(void)"

__free_hook = free;

to 3rdparty/jemalloc/jemalloc.c

Stefan

On 01/05/10 15:04, templedf wrote:
> Looking at that stack trace, the segfault is in libc, probably caused by 
> something in your libnss_ldap.  The routine that's triggering the 
> segfault (sge_gdi2_setup()) is called any time any command or daemon 
> tries to start a conversation with the qmaster, so if it were really a 
> Grid Engine issue, you would not have been the first to find it.  Have 
> you tried updating your libnss_ldap?
> 
> Digging a little deeper, it looks like the cfree() call at the top of 
> the stack is an oddity.  From the cfree(3) man page:
> 
>> This function should never be used. Use free(3) instead. 
> 
> Why the LDAP library is calling cfree() is beyond me.  I also just had a 
> look at the sge_gid2group() and sge_getgrgid() functions, and I can't 
> see anything there that could be the source of a bad pointer.
> 
> As a side note, this is exactly why I love Linux so much.  There's 
> nothing quite as much fun as debugging the kernel to get your 
> application working. :)
> 
> Daniel
> 
> kisielk wrote:
>> I'm trying to get SGE (6.2u4 or 6.2u5) up and running on an OpenSUSE 11.1 host. The daemon runs fine, and all the end-user tools work fine for local users. However, if a user that is defined in LDAP tries to use a tool such as qstat it segfaults. The backtrace output is as follows:
>>
>> *** glibc detected *** bin/lx24-amd64/qstat: free(): invalid pointer: 0x00007fc0af12b0e0 ***
>> ======= Backtrace: =========
>> /lib64/libc.so.6[0x7fc0af27d108]
>> /lib64/libc.so.6(cfree+0x76)[0x7fc0af27ec66]
>> /lib64/libnss_ldap.so.2[0x7fc0aece1c21]
>> /lib64/libnss_ldap.so.2[0x7fc0aecdd913]
>> /lib64/libnss_ldap.so.2[0x7fc0aecdbc22]
>> /lib64/libnss_ldap.so.2(_nss_ldap_getgrgid_r+0x53)[0x7fc0aecdc313]
>> /lib64/libc.so.6(getgrgid_r+0xec)[0x7fc0af2a848c]
>> bin/lx24-amd64/qstat(sge_getgrgid_r+0xc5)[0x53ee85]
>> bin/lx24-amd64/qstat(sge_gid2group+0x6a)[0x53e3aa]
>> bin/lx24-amd64/qstat(sge_setup2+0x1a1)[0x479911]
>> bin/lx24-amd64/qstat(sge_gdi2_setup+0x121)[0x479c71]
>> bin/lx24-amd64/qstat(main+0x125)[0x425ae5]
>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fc0af227586]
>> bin/lx24-amd64/qstat(readdir_r+0xa2)[0x4258ea]
>>
>> I tried both the binaries as well as compiling it from source myself. Has anyone else seen this problem, and is there a known fix?
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236395
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236572
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=236580

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list