[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

dom marco.donauer at sun.com
Wed Apr 7 11:37:16 BST 2010


Hi,

it looks like always SLES os are affected by this issue.
Do you know if there is anything happening before this segfault appears?
Did anything change when the symptom disappears or when it came back?

Marco


Am 06.04.2010 14:35, schrieb beatrubi:
> Hello!
>
> Quoting <ahaupt at ifh.de> (25.03.10 09:19):
>
>   
>> yesterday afternoon our SGE master started segfaulting again and again
>> "out of the blue".
>>     
> I saw this behaviour some time ago on a customers system. It faded away for
> months, now it's back. The sge_qmaster crahes every 5-15 minutes with
> different symptoms:
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x000000000059e983 in sge_htable_for_each ()
> (gdb) bt
> #0  0x000000000059e983 in sge_htable_for_each ()
> #1  0x0000000000559b02 in cull_hash_free_descr ()
> #2  0x00000000005582bf in lFreeList ()
> #3  0x00000000005581f0 in lFreeElem ()
> #4  0x0000000000558935 in lRemoveElem ()
> #5  0x00000000005582ab in lFreeList ()
> #6  0x0000000000431b25 in sge_scheduler_main ()
> #7  0x00002b9925d37143 in start_thread () from /lib64/libpthread.so.0
> #8  0x00002b9925f0b8cd in clone () from /lib64/libc.so.6
> #9  0x0000000000000000 in ?? ()
>
> Program terminated with signal 6, Aborted.
> #0  0x00002add0d07bbb5 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00002add0d07bbb5 in raise () from /lib64/libc.so.6
> #1  0x00002add0d07cfb0 in abort () from /lib64/libc.so.6
> #2  0x00000000005589d8 in lRemoveElem ()
> #3  0x00000000005582ab in lFreeList ()
> #4  0x00000000005581f0 in lFreeElem ()
> #5  0x0000000000558935 in lRemoveElem ()
> #6  0x00000000005582ab in lFreeList ()
> #7  0x0000000000431b25 in sge_scheduler_main ()
> #8  0x00002add0cf38143 in start_thread () from /lib64/libpthread.so.0
> #9  0x00002add0d10c8cd in clone () from /lib64/libc.so.6
> #10 0x0000000000000000 in ?? ()
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000557418 in lCopySwitchPack ()
> (gdb) bt
> #0  0x0000000000557418 in lCopySwitchPack ()
> #1  0x0000000000557101 in lCopyElemHash ()
> #2  0x000000000055708e in lCopyElem ()
> #3  0x000000000055867b in lCopyListHash ()
> #4  0x0000000000563109 in lSelectHashPack ()
> #5  0x0000000000557433 in lCopySwitchPack ()
> #6  0x0000000000557101 in lCopyElemHash ()
> #7  0x000000000055708e in lCopyElem ()
> #8  0x000000000055867b in lCopyListHash ()
> #9  0x0000000000563109 in lSelectHashPack ()
> #10 0x0000000000557433 in lCopySwitchPack ()
> #11 0x0000000000557101 in lCopyElemHash ()
> #12 0x000000000055708e in lCopyElem ()
> #13 0x000000000055867b in lCopyListHash ()
> #14 0x000000000055861e in lCopyList ()
> #15 0x00000000004316e9 in sge_scheduler_main ()
> #16 0x00002acac3250143 in start_thread () from /lib64/libpthread.so.0
> #17 0x00002acac34248cd in clone () from /lib64/libc.so.6
> #18 0x0000000000000000 in ?? ()
>
> GE 6.2u3, amd64 curtesy binaries, SLES10. A mix of serial and parallel jobs.
>
> I tried to add those backtraces any myself as CC to issue #3251 [1]. I'm too
> stupid or there isn't a way to do such things without beeing the issuer of
> the issue.
>
>     [1] http://gridengine.sunsource.net/issues/show_bug.cgi?id=3251
>
> Feel free to ask for any additional feedback!
>
> Beat
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252552

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list