[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

beatrubi beat at 0x1b.ch
Tue Apr 6 13:35:02 BST 2010


Hello!

Quoting <ahaupt at ifh.de> (25.03.10 09:19):

> yesterday afternoon our SGE master started segfaulting again and again
> "out of the blue".

I saw this behaviour some time ago on a customers system. It faded away for
months, now it's back. The sge_qmaster crahes every 5-15 minutes with
different symptoms:

Program terminated with signal 11, Segmentation fault.
#0  0x000000000059e983 in sge_htable_for_each ()
(gdb) bt
#0  0x000000000059e983 in sge_htable_for_each ()
#1  0x0000000000559b02 in cull_hash_free_descr ()
#2  0x00000000005582bf in lFreeList ()
#3  0x00000000005581f0 in lFreeElem ()
#4  0x0000000000558935 in lRemoveElem ()
#5  0x00000000005582ab in lFreeList ()
#6  0x0000000000431b25 in sge_scheduler_main ()
#7  0x00002b9925d37143 in start_thread () from /lib64/libpthread.so.0
#8  0x00002b9925f0b8cd in clone () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

Program terminated with signal 6, Aborted.
#0  0x00002add0d07bbb5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00002add0d07bbb5 in raise () from /lib64/libc.so.6
#1  0x00002add0d07cfb0 in abort () from /lib64/libc.so.6
#2  0x00000000005589d8 in lRemoveElem ()
#3  0x00000000005582ab in lFreeList ()
#4  0x00000000005581f0 in lFreeElem ()
#5  0x0000000000558935 in lRemoveElem ()
#6  0x00000000005582ab in lFreeList ()
#7  0x0000000000431b25 in sge_scheduler_main ()
#8  0x00002add0cf38143 in start_thread () from /lib64/libpthread.so.0
#9  0x00002add0d10c8cd in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Program terminated with signal 11, Segmentation fault.
#0  0x0000000000557418 in lCopySwitchPack ()
(gdb) bt
#0  0x0000000000557418 in lCopySwitchPack ()
#1  0x0000000000557101 in lCopyElemHash ()
#2  0x000000000055708e in lCopyElem ()
#3  0x000000000055867b in lCopyListHash ()
#4  0x0000000000563109 in lSelectHashPack ()
#5  0x0000000000557433 in lCopySwitchPack ()
#6  0x0000000000557101 in lCopyElemHash ()
#7  0x000000000055708e in lCopyElem ()
#8  0x000000000055867b in lCopyListHash ()
#9  0x0000000000563109 in lSelectHashPack ()
#10 0x0000000000557433 in lCopySwitchPack ()
#11 0x0000000000557101 in lCopyElemHash ()
#12 0x000000000055708e in lCopyElem ()
#13 0x000000000055867b in lCopyListHash ()
#14 0x000000000055861e in lCopyList ()
#15 0x00000000004316e9 in sge_scheduler_main ()
#16 0x00002acac3250143 in start_thread () from /lib64/libpthread.so.0
#17 0x00002acac34248cd in clone () from /lib64/libc.so.6
#18 0x0000000000000000 in ?? ()

GE 6.2u3, amd64 curtesy binaries, SLES10. A mix of serial and parallel jobs.

I tried to add those backtraces any myself as CC to issue #3251 [1]. I'm too
stupid or there isn't a way to do such things without beeing the issuer of
the issue.

    [1] http://gridengine.sunsource.net/issues/show_bug.cgi?id=3251

Feel free to ask for any additional feedback!

Beat

-- 
     \|/                           Beat Rubischon <beat at 0x1b.ch>
   ( 0^0 )                             http://www.0x1b.ch/~beat/
oOO--(_)--OOo---------------------------------------------------
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252469

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list