[GE users] shepherd of job 388421.1 died through signal = 11

Roland Dittel Roland.Dittel at Sun.COM
Wed Nov 16 09:33:24 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Jinal A Jhaveri wrote:
> Adding to Kelly's, comments!
> 
> This is really surprising because even though the message says that 
> shepeherd died through signal = 11, when I look at the kernel messages 
> (using dmesg), I see this 
> sge_execd[12014]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
> rsp 00007fffffff7040 error 4
> sge_execd[12015]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
> rsp 00007fffffff7040 error 4
> sge_execd[12013]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
> rsp 00007fffffff7040 error 4
> sge_execd[12012]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
> rsp 00007fffffff7040 error 4

Hm, it seems the execd dies after forking but before becoming the 
shepherd. But there is not much code between forking and becoming shepherd.

> Also when I try to run the shepherd manually it runs fine. My doubt is 
> that there is some severe problem with sge_execd. Here are few other 
> symptoms
> 
> a) Uptill now we have only seen this with array jobs
> b) Uptill now we have only seen this on AMD 64 bit machines running 
> debian

Are you using local or nfs spooling for you execds? We've heard about 
strange default mount options on linux for nfs mount points. Maybe this 
is a nfs related issue at file creation or file pointer closing. Can you 
please switch to local execd_spooldir and try again?

> c) Once the node goes in error state, all the other jobs sent to this 
> node also go in error state (even when we have cleared the error using 
> qmod)
> d) the only files created are config, environment and pe_hostfile in 
> the active_jobs/job-id/ directory. I don't see any other files like 
> exit_status, pid, etc. I think that because sge_execd is corrupted 
> shepherd dies as soon as it starts.

You are right. The files "config", "environment" and "pe_hostfile" are 
created by the execd for the shepherd. The shepherd reads these files at 
startup.

> e) Even though the shepherd has died through signal, it seems like 
> execd still tries to reap the job and look at its various file slike 
> exit_status. Here is the sequence of messages I keep seeing
> "
> 11/15/2005 19:25:51|execd|node64t-02|E|shepherd of job 389770.217 died 
> through signal = 11
> 11/15/2005 19:25:51|execd|node64t-02|W|reaping job "389770" ptf 
> complains: Job does not exist
> 11/15/2005 19:25:51|execd|node64t-02|E|abnormal termination of 
> shepherd for job 389770.217: no "exit_status" file
> 11/15/2005 19:25:51|execd|node64t-02|E|cant open file 
> active_jobs/389770.217/error: No such file or directory
> 11/15/2005 19:25:51|execd|node64t-02|E|can't open pid 
> file "active_jobs/389770.217/pid" for job 389770.217
> "

That's the correct behavoir. The shepherd should not die with segfault :)

> Guys, I would really appreciate if someone can help on this. I am at 
> the SuperComputing '05  conference and was hoping to talk to someone 
> about this but didn't find any Gridengine people!! Is anybody 
> attending the conference?
> 
> 
> Thanks
> --Jinal
> 
> 
> 
> 
> ----- Original Message -----
> From: Kelly Felkins <KFelkins at lbl.gov>
> Date: Tuesday, November 15, 2005 5:50 pm
> Subject: Re: [GE users] shepherd of job 388421.1 died through signal = 
> 11
> 
>> Rayson Ho wrote:
>>
>>> Signal 11 is SEGV, do you have the core file sitting somewhere??
>>>
>>>  
>>>
>> I did some searching and did not find a core file. Where would you 
>> suggest I look? This appears to be failing before my scripts are 
>> run. 
>> The messages files are local to the nodes. Most of our input and 
>> output 
>> is via nfs.
>>
>>> Also, what version of GE and OS are you using? And did you 
>> compile GE
>> >from source or just use the pre-compiled binaries??
>>>  
>>>
>> We are running 6.0u6
>> I determined this from typing 'qhost -xxx' -- is there a better 
>> way?  ;-)
>>
>> We are running a mixed cluster, mostly linux (debian) nodes on 
>> dual cpu 
>> opterons.
>> A linux node:
>>
>>    bash$ uname -a
>>    Linux nodeXXXXX 2.6.11.10.20050515 #1 SMP Mon May 16 16:55:22 PDT
>>    2005 x86_64 GNU/Linux
>>
>> A solaris node:
>>
>>    bash$ uname -a
>>    SunOS nodeXXXXX 5.9 Generic_112233-08 sun4u sparc SUNW,Netra-T12
>>
>> I'm not positive but I believe we are using pre-compiled binaries.
>>
>> Thanks for your help on this.
>>
>> -Kelly
>>
>>> Rayson
>>>
>>>
>>>
>>> On 11/15/05, Kelly Felkins <kfelkins at lbl.gov> wrote:
>>>  
>>>
>>>>   11/15/2005 11:30:26|execd|node64t-10|E|shepherd of job 
>> 388421.1 died
>>>>   through signal = 11
>>>>
>>>>
>>>> I'm seeing this error in the messages files for specific nodes 
>> on our
>>>> cluster. At the moment we have a large array job running, so 
>> there are
>>>> similar jobs on nearly every node. A handful of the nodes get 
>> this error
>>>> and then the queue goes into error state. If you clear the 
>> error, soon
>>>> another task is attempted on the node, which then experiences 
>> the same
>>>> error and the queue goes back into error state.
>>>>
>>>> Please help me diagnose this problem.
>>>>
>>>> Thank you.
>>>>
>>>> -Kelly
>>>>
>>>>
>>>> -----------------------------------------------------------------
>> ----
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-
> help at gridengine.sunsource.net
>>>>
>>>>    
>>>>
>>> ------------------------------------------------------------------
>> ---
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>  
>>>
>>
>>
>> -------------------------------------------------------------------
>> --
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> ----- Original Message -----
> From: Kelly Felkins <KFelkins at lbl.gov>
> Date: Tuesday, November 15, 2005 5:50 pm
> Subject: Re: [GE users] shepherd of job 388421.1 died through signal = 
> 11
> 
>> Rayson Ho wrote:
>>
>>> Signal 11 is SEGV, do you have the core file sitting somewhere??
>>>
>>>  
>>>
>> I did some searching and did not find a core file. Where would you 
>> suggest I look? This appears to be failing before my scripts are 
>> run. 
>> The messages files are local to the nodes. Most of our input and 
>> output 
>> is via nfs.
>>
>>> Also, what version of GE and OS are you using? And did you 
>> compile GE
>> >from source or just use the pre-compiled binaries??
>>>  
>>>
>> We are running 6.0u6
>> I determined this from typing 'qhost -xxx' -- is there a better 
>> way?  ;-)
>>
>> We are running a mixed cluster, mostly linux (debian) nodes on 
>> dual cpu 
>> opterons.
>> A linux node:
>>
>>    bash$ uname -a
>>    Linux nodeXXXXX 2.6.11.10.20050515 #1 SMP Mon May 16 16:55:22 PDT
>>    2005 x86_64 GNU/Linux
>>
>> A solaris node:
>>
>>    bash$ uname -a
>>    SunOS nodeXXXXX 5.9 Generic_112233-08 sun4u sparc SUNW,Netra-T12
>>
>> I'm not positive but I believe we are using pre-compiled binaries.
>>
>> Thanks for your help on this.
>>
>> -Kelly
>>
>>> Rayson
>>>
>>>
>>>
>>> On 11/15/05, Kelly Felkins <kfelkins at lbl.gov> wrote:
>>>  
>>>
>>>>   11/15/2005 11:30:26|execd|node64t-10|E|shepherd of job 
>> 388421.1 died
>>>>   through signal = 11
>>>>
>>>>
>>>> I'm seeing this error in the messages files for specific nodes 
>> on our
>>>> cluster. At the moment we have a large array job running, so 
>> there are
>>>> similar jobs on nearly every node. A handful of the nodes get 
>> this error
>>>> and then the queue goes into error state. If you clear the 
>> error, soon
>>>> another task is attempted on the node, which then experiences 
>> the same
>>>> error and the queue goes back into error state.
>>>>
>>>> Please help me diagnose this problem.
>>>>
>>>> Thank you.
>>>>
>>>> -Kelly
>>>>
>>>>
>>>> -----------------------------------------------------------------
>> ----
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-
> help at gridengine.sunsource.net
>>>>
>>>>    
>>>>
>>> ------------------------------------------------------------------
>> ---
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>  
>>>
>>
>>
>> -------------------------------------------------------------------
>> --
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list