[GE users] shepherd of job 388421.1 died through signal = 11

Kelly Felkins kfelkins at lbl.gov
Wed Nov 16 01:50:14 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Rayson Ho wrote:

>Signal 11 is SEGV, do you have the core file sitting somewhere??
>
>  
>
I did some searching and did not find a core file. Where would you 
suggest I look? This appears to be failing before my scripts are run. 
The messages files are local to the nodes. Most of our input and output 
is via nfs.

>Also, what version of GE and OS are you using? And did you compile GE
>from source or just use the pre-compiled binaries??
>  
>
We are running 6.0u6
I determined this from typing 'qhost -xxx' -- is there a better way?  ;-)

We are running a mixed cluster, mostly linux (debian) nodes on dual cpu 
opterons.
A linux node:

    bash$ uname -a
    Linux nodeXXXXX 2.6.11.10.20050515 #1 SMP Mon May 16 16:55:22 PDT
    2005 x86_64 GNU/Linux

A solaris node:

    bash$ uname -a
    SunOS nodeXXXXX 5.9 Generic_112233-08 sun4u sparc SUNW,Netra-T12

I'm not positive but I believe we are using pre-compiled binaries.

Thanks for your help on this.

-Kelly

>Rayson
>
>
>
>On 11/15/05, Kelly Felkins <kfelkins at lbl.gov> wrote:
>  
>
>>   11/15/2005 11:30:26|execd|node64t-10|E|shepherd of job 388421.1 died
>>   through signal = 11
>>
>>
>>I'm seeing this error in the messages files for specific nodes on our
>>cluster. At the moment we have a large array job running, so there are
>>similar jobs on nearly every node. A handful of the nodes get this error
>>and then the queue goes into error state. If you clear the error, soon
>>another task is attempted on the node, which then experiences the same
>>error and the queue goes back into error state.
>>
>>Please help me diagnose this problem.
>>
>>Thank you.
>>
>>-Kelly
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list