[GE users] sge_shepherd crashes

craffi dag at sonsorol.org
Wed Nov 5 18:46:32 GMT 2008


Roger,

We had some maddening SGE stability issues with Mac OS X 10.5.x that  
all went away when we moved the SGE binaries under the new 10.5  
'launchd' framework.

The process, some utility scripts and other stuff was written up here:
http://wiki.gridengine.info/wiki/index.php/GridEngine_launchd

Since we moved our 10.5 Apple systems over to launchd, all stability  
issues have disappeared. No root cause though.

-Chris

On Oct 30, 2008, at 10:13 PM, Roger Herikstad wrote:

> Hi list,
> I was hoping someone could help with a problem we are having. We are
> running a cluster of 7 Mac machines, all running OSX 10.5.5, some
> G5's, some MacPros. Recently, the sge_shepherd processes crashes on
> the PPCs almost immediately after a job has started running on the
> machine. I was wondering if maybe there is a known issue with some of
> the recent security upgrades from apple, as the problem only surfaced
> after doing these upgrades? Below is the crash report on one of the
> PPCs:
>
> Process:         sge_shepherd [36463]
> Path:            /cluster/sge/bin/darwin-ppc/sge_shepherd
> Identifier:      sge_shepherd
> Version:         ??? (???)
> Code Type:       PPC (Native)
> Parent Process:  sge_execd [139]
>
> Date/Time:       2008-10-31 10:00:38.833 +0800
> OS Version:      Mac OS X 10.5.5 (9F33)
> Report Version:  6
>
> Exception Type:  EXC_BAD_ACCESS (SIGBUS)
> Exception Codes: 0x000000000000000a, 0x000000000026a868
> Crashed Thread:  0
>
> Thread 0 Crashed:
> 0   dyld                          	0x8fe16f4c
> ImageLoaderMachO::findExportedSymbol(char const*, void const*, bool,
> ImageLoader const**) const + 412
> 1   dyld                          	0x8fe13dec
> ImageLoaderMachO::resolveUndefined(ImageLoader::LinkContext const&,
> macho_nlist const*, bool, ImageLoader const**) + 992
> 2   dyld                          	0x8fe142e4
> ImageLoaderMachO 
> ::doBindIndirectSymbolPointers(ImageLoader::LinkContext
> const&, bool, bool, bool) + 572
> 3   dyld                          	0x8fe0da14
> ImageLoader::recursiveBind(ImageLoader::LinkContext const&, bool) +
> 140
> 4   dyld                          	0x8fe0d9e4
> ImageLoader::recursiveBind(ImageLoader::LinkContext const&, bool) + 92
> 5   dyld                          	0x8fe1103c
> ImageLoader::link(ImageLoader::LinkContext const&, bool, bool,
> ImageLoader::RPathChain const&) + 336
> 6   dyld                          	0x8fe05250 dyld::link(ImageLoader*,
> bool, ImageLoader::RPathChain const&) + 372
> 7   dyld                          	0x8fe07fb4 dyld::_main(mach_header
> const*, unsigned long, int, char const**, char const**, char const**)
> + 3024
> 8   dyld                          	0x8fe01770
> dyldbootstrap::start(mach_header const*, int, char const**, long) +
> 988
> 9   dyld                          	0x8fe01044 _dyld_start + 56
>
> Thread 0 crashed with PPC Thread State 32:
>  srr0: 0x8fe16f4c  srr1: 0x0000d030   dar: 0x0026a868 dsisr:  
> 0x40000000
>    r0: 0x00000d40    r1: 0xbfffe380    r2: 0x00003500    r3:  
> 0x0026f448
>    r4: 0x00267368    r5: 0x0025a044    r6: 0x000007f2    r7:  
> 0x000006a0
>    r8: 0x0000054e    r9: 0x00278edf   r10: 0x0017ad55   r11:  
> 0x0017ad55
>   r12: 0x8fe16db0   r13: 0x00000001   r14: 0x00177180   r15:  
> 0x8fe312cc
>   r16: 0x00000000   r17: 0x00000001   r18: 0x8fe312dc   r19:  
> 0x8fe327a0
>   r20: 0x0000000c   r21: 0x00145400   r22: 0x00000001   r23:  
> 0x8fe34800
>   r24: 0x00000001   r25: 0x0017ad56   r26: 0xbfffe558   r27:  
> 0x00000000
>   r28: 0x8fe348cc   r29: 0xffffffed   r30: 0x00264aa8   r31:  
> 0x8fe16dbc
>    cr: 0x84002084   xer: 0x00000000    lr: 0x8fe16dbc   ctr:  
> 0x8fe16db0
> vrsave: 0x00000000
>
> Binary Images:
>    0x1000 -   0x110ff3 +sge_shepherd ??? (???)
> /cluster/sge/bin/darwin-ppc/sge_shepherd
>  0x145000 -   0x16fff7 +libssl.0.9.7.dylib ??? (???)
> <5dac2e94552ad76696c35bd6886f5a92>
> /cluster/sge/lib/darwin-ppc/libssl.0.9.7.dylib
>  0x17e000 -   0x238fff +libcrypto.0.9.7.dylib ??? (???)
> <4ea3d7e9a1c28ac7b17ed80873fe6598>
> /cluster/sge/lib/darwin-ppc/libcrypto.0.9.7.dylib
> 0x8fe00000 - 0x8fe30b23  dyld 96.2 (???)
> <39109181acbf30fed542e6c9abcf1798> /usr/lib/dyld
> 0x901ea000 - 0x90383fe3  libSystem.B.dylib ??? (???)
> <787ea59c19201d04a507b13d2bb3f9ac> /usr/lib/libSystem.B.dylib
> 0x907ce000 - 0x907d9ffb  libgcc_s.1.dylib ??? (???)
> <ea47fd375407f162c76d14d64ba246cd> /usr/lib/libgcc_s.1.dylib
> 0x952bc000 - 0x952c1ff6  libmathCommon.A.dylib ??? (???)
> /usr/lib/system/libmathCommon.A.dylib
> 0xffff8000 - 0xffff9703  libSystem.B.dylib ??? (???) /usr/lib/ 
> libSystem.B.dylib
>
> I would be very happy if anyone could offer some help, or point me in
> the right direction on this issue. Thanks a lot!
>
> ~ Roger
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88135

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list