[GE users] shepherd of job 388421.1 died through signal = 11

Jinal A Jhaveri JAJhaveri at lbl.gov
Wed Nov 16 03:47:42 GMT 2005


Adding to Kelly's, comments!

This is really surprising because even though the message says that 
shepeherd died through signal = 11, when I look at the kernel messages 
(using dmesg), I see this 
sge_execd[12014]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
rsp 00007fffffff7040 error 4
sge_execd[12015]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
rsp 00007fffffff7040 error 4
sge_execd[12013]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
rsp 00007fffffff7040 error 4
sge_execd[12012]: segfault at 00000000000000f0 rip 00002aaaab699e3b 
rsp 00007fffffff7040 error 4


Also when I try to run the shepherd manually it runs fine. My doubt is 
that there is some severe problem with sge_execd. Here are few other 
symptoms

a) Uptill now we have only seen this with array jobs
b) Uptill now we have only seen this on AMD 64 bit machines running 
debian
c) Once the node goes in error state, all the other jobs sent to this 
node also go in error state (even when we have cleared the error using 
qmod)
d) the only files created are config, environment and pe_hostfile in 
the active_jobs/job-id/ directory. I don't see any other files like 
exit_status, pid, etc. I think that because sge_execd is corrupted 
shepherd dies as soon as it starts.
e) Even though the shepherd has died through signal, it seems like 
execd still tries to reap the job and look at its various file slike 
exit_status. Here is the sequence of messages I keep seeing
"
11/15/2005 19:25:51|execd|node64t-02|E|shepherd of job 389770.217 died 
through signal = 11
11/15/2005 19:25:51|execd|node64t-02|W|reaping job "389770" ptf 
complains: Job does not exist
11/15/2005 19:25:51|execd|node64t-02|E|abnormal termination of 
shepherd for job 389770.217: no "exit_status" file
11/15/2005 19:25:51|execd|node64t-02|E|cant open file 
active_jobs/389770.217/error: No such file or directory
11/15/2005 19:25:51|execd|node64t-02|E|can't open pid 
file "active_jobs/389770.217/pid" for job 389770.217
"


Guys, I would really appreciate if someone can help on this. I am at 
the SuperComputing '05  conference and was hoping to talk to someone 
about this but didn't find any Gridengine people!! Is anybody 
attending the conference?


Thanks
--Jinal




----- Original Message -----
From: Kelly Felkins <KFelkins at lbl.gov>
Date: Tuesday, November 15, 2005 5:50 pm
Subject: Re: [GE users] shepherd of job 388421.1 died through signal = 
11

> Rayson Ho wrote:
> 
> >Signal 11 is SEGV, do you have the core file sitting somewhere??
> >
> >  
> >
> I did some searching and did not find a core file. Where would you 
> suggest I look? This appears to be failing before my scripts are 
> run. 
> The messages files are local to the nodes. Most of our input and 
> output 
> is via nfs.
> 
> >Also, what version of GE and OS are you using? And did you 
> compile GE
> >from source or just use the pre-compiled binaries??
> >  
> >
> We are running 6.0u6
> I determined this from typing 'qhost -xxx' -- is there a better 
> way?  ;-)
> 
> We are running a mixed cluster, mostly linux (debian) nodes on 
> dual cpu 
> opterons.
> A linux node:
> 
>    bash$ uname -a
>    Linux nodeXXXXX 2.6.11.10.20050515 #1 SMP Mon May 16 16:55:22 PDT
>    2005 x86_64 GNU/Linux
> 
> A solaris node:
> 
>    bash$ uname -a
>    SunOS nodeXXXXX 5.9 Generic_112233-08 sun4u sparc SUNW,Netra-T12
> 
> I'm not positive but I believe we are using pre-compiled binaries.
> 
> Thanks for your help on this.
> 
> -Kelly
> 
> >Rayson
> >
> >
> >
> >On 11/15/05, Kelly Felkins <kfelkins at lbl.gov> wrote:
> >  
> >
> >>   11/15/2005 11:30:26|execd|node64t-10|E|shepherd of job 
> 388421.1 died
> >>   through signal = 11
> >>
> >>
> >>I'm seeing this error in the messages files for specific nodes 
> on our
> >>cluster. At the moment we have a large array job running, so 
> there are
> >>similar jobs on nearly every node. A handful of the nodes get 
> this error
> >>and then the queue goes into error state. If you clear the 
> error, soon
> >>another task is attempted on the node, which then experiences 
> the same
> >>error and the queue goes back into error state.
> >>
> >>Please help me diagnose this problem.
> >>
> >>Thank you.
> >>
> >>-Kelly
> >>
> >>
> >>-----------------------------------------------------------------
> ----
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-
help at gridengine.sunsource.net
> >>
> >>
> >>    
> >>
> >
> >------------------------------------------------------------------
> ---
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> 
> -------------------------------------------------------------------
> --
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

----- Original Message -----
From: Kelly Felkins <KFelkins at lbl.gov>
Date: Tuesday, November 15, 2005 5:50 pm
Subject: Re: [GE users] shepherd of job 388421.1 died through signal = 
11

> Rayson Ho wrote:
> 
> >Signal 11 is SEGV, do you have the core file sitting somewhere??
> >
> >  
> >
> I did some searching and did not find a core file. Where would you 
> suggest I look? This appears to be failing before my scripts are 
> run. 
> The messages files are local to the nodes. Most of our input and 
> output 
> is via nfs.
> 
> >Also, what version of GE and OS are you using? And did you 
> compile GE
> >from source or just use the pre-compiled binaries??
> >  
> >
> We are running 6.0u6
> I determined this from typing 'qhost -xxx' -- is there a better 
> way?  ;-)
> 
> We are running a mixed cluster, mostly linux (debian) nodes on 
> dual cpu 
> opterons.
> A linux node:
> 
>    bash$ uname -a
>    Linux nodeXXXXX 2.6.11.10.20050515 #1 SMP Mon May 16 16:55:22 PDT
>    2005 x86_64 GNU/Linux
> 
> A solaris node:
> 
>    bash$ uname -a
>    SunOS nodeXXXXX 5.9 Generic_112233-08 sun4u sparc SUNW,Netra-T12
> 
> I'm not positive but I believe we are using pre-compiled binaries.
> 
> Thanks for your help on this.
> 
> -Kelly
> 
> >Rayson
> >
> >
> >
> >On 11/15/05, Kelly Felkins <kfelkins at lbl.gov> wrote:
> >  
> >
> >>   11/15/2005 11:30:26|execd|node64t-10|E|shepherd of job 
> 388421.1 died
> >>   through signal = 11
> >>
> >>
> >>I'm seeing this error in the messages files for specific nodes 
> on our
> >>cluster. At the moment we have a large array job running, so 
> there are
> >>similar jobs on nearly every node. A handful of the nodes get 
> this error
> >>and then the queue goes into error state. If you clear the 
> error, soon
> >>another task is attempted on the node, which then experiences 
> the same
> >>error and the queue goes back into error state.
> >>
> >>Please help me diagnose this problem.
> >>
> >>Thank you.
> >>
> >>-Kelly
> >>
> >>
> >>-----------------------------------------------------------------
> ----
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-
help at gridengine.sunsource.net
> >>
> >>
> >>    
> >>
> >
> >------------------------------------------------------------------
> ---
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> 
> -------------------------------------------------------------------
> --
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list