[GE users] shepherd exited with exit status 19

Gon?alo Borges goncalo at lip.pt
Wed May 7 15:37:34 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]


Hi All,

Thanks to the ones which tried to help! I think I understand why this 
was happening, although I also think that there is some kind of SGE bug!

Let me summarize:

1) We were able to correlate those shepherd errors with controlled 
sge_execd shutdowns. A recent change in our fabric management system was 
restarting sge_execd in all queue instances, and afterwards, we were 
getting messages like:

05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
"active_jobs/330153.1"
05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
active_jobs/330153.1 has pid "16717" and is not alive
05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd for 
job 330153.1: "exit_status" file is empty
05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
"active_jobs/330153.1/usage" for job 330153.1: No such file or directory

2) I really don't know if execd has some mechanisms to recover running 
jobs before the controlled shutdown... If it has, it is not working 
properly, at least, in our configuration, because the running jobs 
dissapear from SGE, at least apparently, because the shepherd dies but 
the processes started by the shepherd continue in the machine... Looking 
to the tree of process (pstree -Gap) after the execd controlled 
shutdown, you would see the user script starting directly from "init,1":

init,1
??sh,21479 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333897
? ??bootstrap.k2181,21813 -w /tmp/bootstrap.k21810 /home/cms067/ 
ce02.lip.pt 
/home/cms067/.globus/job/ce02.lip.pt/22189.1210069337/x509_up ...
? ??bootstrap.k2181,21817 -w /tmp/bootstrap.k21810 /home/cms067/ 
ce02.lip.pt ...
? ??bootstrap.k2181,21974 -w /tmp/bootstrap.k21810 /home/cms067/ 
ce02.lip.pt ...
? ??sh,22013 -c...
? ??jobwrapper,22014 /opt/lcg/libexec/jobwrapper 
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
? ??globus-tmp.lfli,22015 
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
? ??globus-tmp.lfli,22134 
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
? ??time,22135 -p perl -e...
? ??sh,22137 -c ...
? ??jobExecutor,22138 2
? ??ch_tool,22149
? ? ??programExecutor,22152 1
? ? ??BossRuntime_cra,22158 ./BossRuntime_crabjob
? ? ??CMSSW.sh,22162 CMSSW.sh 2 ...
? ? ? ??cmsRun,22330 -j crab_fjr.xml -p pset.cfg
? ? ??programExecutor,22159 1
? ? ??tee,22160 /tmp//BossTeePipe-crabjob22152
? ? ??tee,22161 /tmp//BossTeePipe-crabjob22152
? ??dbUpdator,22144 1_2_1 746b7b06-ab68-4ae2-a9fe-0c7a94c6cf42 RTConfig.clad
? ??sleep,31466 30


3) The problems (bug?) which I think it may exist is connected to:
a) If execd has some mechanisms to recover running jobs before a 
restart, it doesn't seem to be working properly...
b) If execd does not has some mechanisms to recover running jobs before 
a controlled shutdown, then it should kill all active jobs (and child 
processes) properly. This doesn't seem the case also.

4) Check the following example... This is the tree of processes for a 
normal job...

init,1
??sge_execd,9328
? ??sge_shepherd,16984 -bg
? ? ??sh,16986 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
? ? ??bootstrap.P1737,17375 -w /tmp/bootstrap.P17371 /home/cms067/ 
ce02.lip.pt ...
? ? ??bootstrap.P1737,17384 -w /tmp/bootstrap.P17371 /home/cms067/ 
ce02.lip.pt ...
? ? ??bootstrap.P1737,17558 -w /tmp/bootstrap.P17371 /home/cms067/ 
ce02.lip.pt ...
? ? ??sh,17653 -c...
? ? ??jobwrapper,17656 /opt/lcg/libexec/jobwrapper 
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
? ? ??globus-tmp.lfli,17658 
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
? ? ??globus-tmp.lfli,18122 
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
? ? ??time,18123 -p perl -e...
? ? ??sh,18125 -c ...
? ? ??jobExecutor,18127 15
? ? ??ch_tool,18144
? ? ? ??programExecutor,18147 1
? ? ? ??BossRuntime_cra,18155 ./BossRuntime_crabjob
? ? ? ??CMSSW.sh,18159 CMSSW.sh 15 ...
? ? ? ? ??cmsRun,18441 -j crab_fjr.xml -p pset.cfg
? ? ? ??programExecutor,18156 1
? ? ? ??tee,18157 /tmp//BossTeePipe-crabjob18147
? ? ? ??tee,18158 /tmp//BossTeePipe-crabjob18147
? ? ??dbUpdator,18137 1_15_1 6a4a2a4d-a0af-4f88-a8c2-d810a4a6bfa6 
RTConfig.clad
? ? ??sleep,30576 30


If you try to see the different processes ID, you see that sge_execd and 
shepherd share the same SID

PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1 9328 9328 1801 ? -1 S 0 16:10 /usr/local/sge/V61u3/bin/lx26-x86/sge_execd
9328 16984 16984 1801 ? -1 S 0 0:00 sge_shepherd-333815 -bg
9328 21478 21478 1801 ? -1 S 0 0:00 sge_shepherd-333897 -bg

but shepherd does not share the same SID with the user job...

[root at lflip32 ~]# ps -jxa | grep 
/usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
16984 16986 16986 16986 ? -1 S 2067 0:00 -sh 
/usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815

Can this be the reasons for situation described in 3 b) ?

I hope this is useful to someone!
Cheers
Goncalo


McCalla, Mac wrote:
> I have also experienced this occasionally with mpi jobs in our SGE 6.1.u3 cluster (local spool, NFS for binaries).  we use lam/mpi 7.1.4 in a tight integration.  I have changed the timeout value in qsh.c to 180 seconds, which seemed to help in our case. 
>
> Cheers,
>
> Mac McCalla
> -----Original Message-----
> From: brooks at face.aero.org [mailto:brooks at face.aero.org] On Behalf Of Brooks Davis
> Sent: 06 May 2008 13:38
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] shepherd exited with exit status 19
>
> On Tue, May 06, 2008 at 07:11:46PM +0100, Gon?alo Borges wrote:
>   
>> Hi,
>>
>> This is happening over and over again!!! Shepherd is dying with a 
>> message similar to:
>>
>>   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>>   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>>   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>>   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
>> "active_jobs/330153.1"
>>   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job
>> active_jobs/330153.1 has pid "16717" and is not alive
>>   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd 
>> for job 330153.1: "exit_status" file is empty
>>   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
>> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>>   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit status 
>> 19
>>
>> NFS is not causing the problem because the spool directory is on local disk!
>>
>> After shepherd death, SGE thinks the job finished, and allows new jobs 
>> to enter. However, the processes which were controled by the previous 
>> alive shepherd are still there...
>> It comes to a point where the machines enter in a very, very high load!!!!
>>     
>
> We've been experiencing this on our cluster, typically when starting large mpi parallel jobs (250 slots).  I've tried adjusting the timeouts in the sge versions of the rsh programs without success.  Like you, our spool directories are local.  Our sge binaries are on NFS so that's a possibility as are NIS timeout issues.  I've made a number of changes to try and mitigate both, but have not been able to fix the problem.
>
> -- Brooks
>
>   
>> To whom can I ask for more technical help on this issue? We really 
>> need help on this...
>>
>> Goncalo
>>
>>
>> Gon?alo Borges wrote:
>>     
>>> Hi All,
>>>
>>> I'm seeing the following problem in SGE V6u3_1:
>>>
>>> - A user started to complain that his jobs were not being executed 
>>> although there were free machines;
>>>
>>> - Indeed the machines were free (no jobs were shown by qstat) but 
>>> under very heavy high load, above the defined threshold, not allowing 
>>> new jobs to be executed.
>>>
>>> - The load was originated by old jobs not properly killed (we could 
>>> see several processes, using ps xuawww, still running in the 
>>> machine)... Somehow SGE lost control...
>>>
>>> - The logs on the machine showed something like:
>>>   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>>>   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>>>   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>>>   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
>>> "active_jobs/330153.1"
>>>   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job
>>> active_jobs/330153.1 has pid "16717" and is not alive
>>>   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
>>> shepherd for job 330153.1: "exit_status" file is empty
>>>   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
>>> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>>>   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
>>> status 19
>>>
>>> Any hits?
>>> Cheers
>>> Goncalo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list