[GE users] shepherd exited with exit status 19

Harald Pollinger Harald.Pollinger at Sun.COM
Thu May 8 10:54:32 BST 2008


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Gon?alo,

yes, please file a bug report for this!

Thanks!
Harald

Gon?alo Borges wrote:
> Hi,
> 
> Can someone tell me if this is sufficient to fill a bug report?
> Cheers
> Goncalo
> 
> Gonçalo Borges wrote:
>>
>> Hi All,
>>
>> Thanks to the ones which tried to help! I think I understand why this 
>> was happening, although I also think that there is some kind of SGE bug!
>>
>> Let me summarize:
>>
>> 1) We were able to correlate those shepherd errors with controlled 
>> sge_execd shutdowns. A recent change in our fabric management system 
>> was restarting sge_execd in all queue instances, and afterwards, we 
>> were getting messages like:
>>
>> 05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>> 05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>> 05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>> 05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
>> "active_jobs/330153.1"
>> 05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
>> active_jobs/330153.1 has pid "16717" and is not alive
>> 05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd 
>> for job 330153.1: "exit_status" file is empty
>> 05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
>> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>>
>> 2) I really don't know if execd has some mechanisms to recover running 
>> jobs before the controlled shutdown... If it has, it is not working 
>> properly, at least, in our configuration, because the running jobs 
>> dissapear from SGE, at least apparently, because the shepherd dies but 
>> the processes started by the shepherd continue in the machine... 
>> Looking to the tree of process (pstree -Gap) after the execd 
>> controlled shutdown, you would see the user script starting directly 
>> from "init,1":
>>
>> init,1
>> â??â??sh,21479 
>> /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333897
>> â?? â??â??bootstrap.k2181,21813 -w /tmp/bootstrap.k21810 /home/cms067/ 
>> ce02.lip.pt 
>> /home/cms067/.globus/job/ce02.lip.pt/22189.1210069337/x509_up ...
>> â?? â??â??bootstrap.k2181,21817 -w /tmp/bootstrap.k21810 /home/cms067/ 
>> ce02.lip.pt ...
>> â?? â??â??bootstrap.k2181,21974 -w /tmp/bootstrap.k21810 /home/cms067/ 
>> ce02.lip.pt ...
>> â?? â??â??sh,22013 -c...
>> â?? â??â??jobwrapper,22014 /opt/lcg/libexec/jobwrapper 
>> /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
>> â?? â??â??globus-tmp.lfli,22015 
>> /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
>> â?? â??â??globus-tmp.lfli,22134 
>> /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
>> â?? â??â??time,22135 -p perl -e...
>> â?? â??â??sh,22137 -c ...
>> â?? â??â??jobExecutor,22138 2
>> â?? â??â??ch_tool,22149
>> â?? â?? â??â??programExecutor,22152 1
>> â?? â?? â??â??BossRuntime_cra,22158 ./BossRuntime_crabjob
>> â?? â?? â??â??CMSSW.sh,22162 CMSSW.sh 2 ...
>> â?? â?? â?? â??â??cmsRun,22330 -j crab_fjr.xml -p pset.cfg
>> â?? â?? â??â??programExecutor,22159 1
>> â?? â?? â??â??tee,22160 /tmp//BossTeePipe-crabjob22152
>> â?? â?? â??â??tee,22161 /tmp//BossTeePipe-crabjob22152
>> â?? â??â??dbUpdator,22144 1_2_1 746b7b06-ab68-4ae2-a9fe-0c7a94c6cf42 
>> RTConfig.clad
>> â?? â??â??sleep,31466 30
>>
>>
>> 3) The problems (bug?) which I think it may exist is connected to:
>> a) If execd has some mechanisms to recover running jobs before a 
>> restart, it doesn't seem to be working properly...
>> b) If execd does not has some mechanisms to recover running jobs 
>> before a controlled shutdown, then it should kill all active jobs (and 
>> child processes) properly. This doesn't seem the case also.
>>
>> 4) Check the following example... This is the tree of processes for a 
>> normal job...
>>
>> init,1
>> â??â??sge_execd,9328
>> â?? â??â??sge_shepherd,16984 -bg
>> â?? â?? â??â??sh,16986 
>> /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
>> â?? â?? â??â??bootstrap.P1737,17375 -w /tmp/bootstrap.P17371 
>> /home/cms067/ ce02.lip.pt ...
>> â?? â?? â??â??bootstrap.P1737,17384 -w /tmp/bootstrap.P17371 
>> /home/cms067/ ce02.lip.pt ...
>> â?? â?? â??â??bootstrap.P1737,17558 -w /tmp/bootstrap.P17371 
>> /home/cms067/ ce02.lip.pt ...
>> â?? â?? â??â??sh,17653 -c...
>> â?? â?? â??â??jobwrapper,17656 /opt/lcg/libexec/jobwrapper 
>> /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
>> â?? â?? â??â??globus-tmp.lfli,17658 
>> /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
>> â?? â?? â??â??globus-tmp.lfli,18122 
>> /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
>> â?? â?? â??â??time,18123 -p perl -e...
>> â?? â?? â??â??sh,18125 -c ...
>> â?? â?? â??â??jobExecutor,18127 15
>> â?? â?? â??â??ch_tool,18144
>> â?? â?? â?? â??â??programExecutor,18147 1
>> â?? â?? â?? â??â??BossRuntime_cra,18155 ./BossRuntime_crabjob
>> â?? â?? â?? â??â??CMSSW.sh,18159 CMSSW.sh 15 ...
>> â?? â?? â?? â?? â??â??cmsRun,18441 -j crab_fjr.xml -p pset.cfg
>> â?? â?? â?? â??â??programExecutor,18156 1
>> â?? â?? â?? â??â??tee,18157 /tmp//BossTeePipe-crabjob18147
>> â?? â?? â?? â??â??tee,18158 /tmp//BossTeePipe-crabjob18147
>> â?? â?? â??â??dbUpdator,18137 1_15_1 
>> 6a4a2a4d-a0af-4f88-a8c2-d810a4a6bfa6 RTConfig.clad
>> â?? â?? â??â??sleep,30576 30
>>
>>
>> If you try to see the different processes ID, you see that sge_execd 
>> and shepherd share the same SID
>>
>> PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
>> 1 9328 9328 1801 ? -1 S 0 16:10 
>> /usr/local/sge/V61u3/bin/lx26-x86/sge_execd
>> 9328 16984 16984 1801 ? -1 S 0 0:00 sge_shepherd-333815 -bg
>> 9328 21478 21478 1801 ? -1 S 0 0:00 sge_shepherd-333897 -bg
>>
>> but shepherd does not share the same SID with the user job...
>>
>> [root at lflip32 ~]# ps -jxa | grep 
>> /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
>> 16984 16986 16986 16986 ? -1 S 2067 0:00 -sh 
>> /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
>>
>> Can this be the reasons for situation described in 3 b) ?
>>
>> I hope this is useful to someone!
>> Cheers
>> Goncalo
>>
>>
>> McCalla, Mac wrote:
>>> I have also experienced this occasionally with mpi jobs in our SGE 
>>> 6.1.u3 cluster (local spool, NFS for binaries).  we use lam/mpi 7.1.4 
>>> in a tight integration.  I have changed the timeout value in qsh.c to 
>>> 180 seconds, which seemed to help in our case.
>>> Cheers,
>>>
>>> Mac McCalla
>>> -----Original Message-----
>>> From: brooks at face.aero.org [mailto:brooks at face.aero.org] On Behalf Of 
>>> Brooks Davis
>>> Sent: 06 May 2008 13:38
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] shepherd exited with exit status 19
>>>
>>> On Tue, May 06, 2008 at 07:11:46PM +0100, Gonçalo Borges wrote:
>>>  
>>>> Hi,
>>>>
>>>> This is happening over and over again!!! Shepherd is dying with a 
>>>> message similar to:
>>>>
>>>>   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>>>>   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>>>>   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>>>>   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
>>>> "active_jobs/330153.1"
>>>>   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job
>>>> active_jobs/330153.1 has pid "16717" and is not alive
>>>>   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
>>>> shepherd for job 330153.1: "exit_status" file is empty
>>>>   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
>>>> "active_jobs/330153.1/usage" for job 330153.1: No such file or 
>>>> directory
>>>>   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
>>>> status 19
>>>>
>>>> NFS is not causing the problem because the spool directory is on 
>>>> local disk!
>>>>
>>>> After shepherd death, SGE thinks the job finished, and allows new 
>>>> jobs to enter. However, the processes which were controled by the 
>>>> previous alive shepherd are still there...
>>>> It comes to a point where the machines enter in a very, very high 
>>>> load!!!!
>>>>     
>>>
>>> We've been experiencing this on our cluster, typically when starting 
>>> large mpi parallel jobs (250 slots).  I've tried adjusting the 
>>> timeouts in the sge versions of the rsh programs without success.  
>>> Like you, our spool directories are local.  Our sge binaries are on 
>>> NFS so that's a possibility as are NIS timeout issues.  I've made a 
>>> number of changes to try and mitigate both, but have not been able to 
>>> fix the problem.
>>>
>>> -- Brooks
>>>
>>>  
>>>> To whom can I ask for more technical help on this issue? We really 
>>>> need help on this...
>>>>
>>>> Goncalo
>>>>
>>>>
>>>> Gonçalo Borges wrote:
>>>>   
>>>>> Hi All,
>>>>>
>>>>> I'm seeing the following problem in SGE V6u3_1:
>>>>>
>>>>> - A user started to complain that his jobs were not being executed 
>>>>> although there were free machines;
>>>>>
>>>>> - Indeed the machines were free (no jobs were shown by qstat) but 
>>>>> under very heavy high load, above the defined threshold, not 
>>>>> allowing new jobs to be executed.
>>>>>
>>>>> - The load was originated by old jobs not properly killed (we could 
>>>>> see several processes, using ps xuawww, still running in the 
>>>>> machine)... Somehow SGE lost control...
>>>>>
>>>>> - The logs on the machine showed something like:
>>>>>   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>>>>>   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>>>>>   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>>>>>   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
>>>>> "active_jobs/330153.1"
>>>>>   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job
>>>>> active_jobs/330153.1 has pid "16717" and is not alive
>>>>>   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
>>>>> shepherd for job 330153.1: "exit_status" file is empty
>>>>>   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
>>>>> "active_jobs/330153.1/usage" for job 330153.1: No such file or 
>>>>> directory
>>>>>   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
>>>>> status 19
>>>>>
>>>>> Any hits?
>>>>> Cheers
>>>>> Goncalo
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>       
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>     
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>   
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list