[GE users] sge_shepherd crashes

Roger Herikstad roger.herikstad at gmail.com
Thu Nov 6 03:29:45 GMT 2008


Chris,
 Thanks! I just installed the scripts and it seems to do the trick.
I'll keep monitoring, though. Thanks for the advice!

~ Roger

On Thu, Nov 6, 2008 at 2:46 AM, Chris Dagdigian <dag at sonsorol.org> wrote:
>
> Roger,
>
> We had some maddening SGE stability issues with Mac OS X 10.5.x that all
> went away when we moved the SGE binaries under the new 10.5 'launchd'
> framework.
>
> The process, some utility scripts and other stuff was written up here:
> http://wiki.gridengine.info/wiki/index.php/GridEngine_launchd
>
> Since we moved our 10.5 Apple systems over to launchd, all stability issues
> have disappeared. No root cause though.
>
> -Chris
>
> On Oct 30, 2008, at 10:13 PM, Roger Herikstad wrote:
>
>> Hi list,
>> I was hoping someone could help with a problem we are having. We are
>> running a cluster of 7 Mac machines, all running OSX 10.5.5, some
>> G5's, some MacPros. Recently, the sge_shepherd processes crashes on
>> the PPCs almost immediately after a job has started running on the
>> machine. I was wondering if maybe there is a known issue with some of
>> the recent security upgrades from apple, as the problem only surfaced
>> after doing these upgrades? Below is the crash report on one of the
>> PPCs:
>>
>> Process:         sge_shepherd [36463]
>> Path:            /cluster/sge/bin/darwin-ppc/sge_shepherd
>> Identifier:      sge_shepherd
>> Version:         ??? (???)
>> Code Type:       PPC (Native)
>> Parent Process:  sge_execd [139]
>>
>> Date/Time:       2008-10-31 10:00:38.833 +0800
>> OS Version:      Mac OS X 10.5.5 (9F33)
>> Report Version:  6
>>
>> Exception Type:  EXC_BAD_ACCESS (SIGBUS)
>> Exception Codes: 0x000000000000000a, 0x000000000026a868
>> Crashed Thread:  0
>>
>> Thread 0 Crashed:
>> 0   dyld                                0x8fe16f4c
>> ImageLoaderMachO::findExportedSymbol(char const*, void const*, bool,
>> ImageLoader const**) const + 412
>> 1   dyld                                0x8fe13dec
>> ImageLoaderMachO::resolveUndefined(ImageLoader::LinkContext const&,
>> macho_nlist const*, bool, ImageLoader const**) + 992
>> 2   dyld                                0x8fe142e4
>> ImageLoaderMachO::doBindIndirectSymbolPointers(ImageLoader::LinkContext
>> const&, bool, bool, bool) + 572
>> 3   dyld                                0x8fe0da14
>> ImageLoader::recursiveBind(ImageLoader::LinkContext const&, bool) +
>> 140
>> 4   dyld                                0x8fe0d9e4
>> ImageLoader::recursiveBind(ImageLoader::LinkContext const&, bool) + 92
>> 5   dyld                                0x8fe1103c
>> ImageLoader::link(ImageLoader::LinkContext const&, bool, bool,
>> ImageLoader::RPathChain const&) + 336
>> 6   dyld                                0x8fe05250
>> dyld::link(ImageLoader*,
>> bool, ImageLoader::RPathChain const&) + 372
>> 7   dyld                                0x8fe07fb4 dyld::_main(mach_header
>> const*, unsigned long, int, char const**, char const**, char const**)
>> + 3024
>> 8   dyld                                0x8fe01770
>> dyldbootstrap::start(mach_header const*, int, char const**, long) +
>> 988
>> 9   dyld                                0x8fe01044 _dyld_start + 56
>>
>> Thread 0 crashed with PPC Thread State 32:
>>  srr0: 0x8fe16f4c  srr1: 0x0000d030   dar: 0x0026a868 dsisr: 0x40000000
>>   r0: 0x00000d40    r1: 0xbfffe380    r2: 0x00003500    r3: 0x0026f448
>>   r4: 0x00267368    r5: 0x0025a044    r6: 0x000007f2    r7: 0x000006a0
>>   r8: 0x0000054e    r9: 0x00278edf   r10: 0x0017ad55   r11: 0x0017ad55
>>  r12: 0x8fe16db0   r13: 0x00000001   r14: 0x00177180   r15: 0x8fe312cc
>>  r16: 0x00000000   r17: 0x00000001   r18: 0x8fe312dc   r19: 0x8fe327a0
>>  r20: 0x0000000c   r21: 0x00145400   r22: 0x00000001   r23: 0x8fe34800
>>  r24: 0x00000001   r25: 0x0017ad56   r26: 0xbfffe558   r27: 0x00000000
>>  r28: 0x8fe348cc   r29: 0xffffffed   r30: 0x00264aa8   r31: 0x8fe16dbc
>>   cr: 0x84002084   xer: 0x00000000    lr: 0x8fe16dbc   ctr: 0x8fe16db0
>> vrsave: 0x00000000
>>
>> Binary Images:
>>   0x1000 -   0x110ff3 +sge_shepherd ??? (???)
>> /cluster/sge/bin/darwin-ppc/sge_shepherd
>>  0x145000 -   0x16fff7 +libssl.0.9.7.dylib ??? (???)
>> <5dac2e94552ad76696c35bd6886f5a92>
>> /cluster/sge/lib/darwin-ppc/libssl.0.9.7.dylib
>>  0x17e000 -   0x238fff +libcrypto.0.9.7.dylib ??? (???)
>> <4ea3d7e9a1c28ac7b17ed80873fe6598>
>> /cluster/sge/lib/darwin-ppc/libcrypto.0.9.7.dylib
>> 0x8fe00000 - 0x8fe30b23  dyld 96.2 (???)
>> <39109181acbf30fed542e6c9abcf1798> /usr/lib/dyld
>> 0x901ea000 - 0x90383fe3  libSystem.B.dylib ??? (???)
>> <787ea59c19201d04a507b13d2bb3f9ac> /usr/lib/libSystem.B.dylib
>> 0x907ce000 - 0x907d9ffb  libgcc_s.1.dylib ??? (???)
>> <ea47fd375407f162c76d14d64ba246cd> /usr/lib/libgcc_s.1.dylib
>> 0x952bc000 - 0x952c1ff6  libmathCommon.A.dylib ??? (???)
>> /usr/lib/system/libmathCommon.A.dylib
>> 0xffff8000 - 0xffff9703  libSystem.B.dylib ??? (???)
>> /usr/lib/libSystem.B.dylib
>>
>> I would be very happy if anyone could offer some help, or point me in
>> the right direction on this issue. Thanks a lot!
>>
>> ~ Roger
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88151

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list