[GE users] shepherd exited with exit status 19

Brooks Davis brooks at aero.org
Tue May 6 19:37:34 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On Tue, May 06, 2008 at 07:11:46PM +0100, Gon?alo Borges wrote:
> Hi,
> 
> This is happening over and over again!!! Shepherd is dying with a 
> message similar to:
> 
>   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
> "active_jobs/330153.1"
>   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
> active_jobs/330153.1 has pid "16717" and is not alive
>   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd 
> for job 330153.1: "exit_status" file is empty
>   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit status 19
> 
> NFS is not causing the problem because the spool directory is on local disk!
> 
> After shepherd death, SGE thinks the job finished, and allows new jobs 
> to enter. However, the processes which were controled by the previous 
> alive shepherd are still there...
> It comes to a point where the machines enter in a very, very high load!!!!

We've been experiencing this on our cluster, typically when starting
large mpi parallel jobs (250 slots).  I've tried adjusting the timeouts
in the sge versions of the rsh programs without success.  Like you, our
spool directories are local.  Our sge binaries are on NFS so that's a
possibility as are NIS timeout issues.  I've made a number of changes to
try and mitigate both, but have not been able to fix the problem.

-- Brooks

> To whom can I ask for more technical help on this issue? We really need 
> help on this...
> 
> Goncalo
> 
> 
> Gon?alo Borges wrote:
> >Hi All,
> >
> >I'm seeing the following problem in SGE V6u3_1:
> >
> >- A user started to complain that his jobs were not being executed 
> >although there were free machines;
> >
> >- Indeed the machines were free (no jobs were shown by qstat) but 
> >under very heavy high load, above the defined threshold, not allowing 
> >new jobs to be executed.
> >
> >- The load was originated by old jobs not properly killed (we could 
> >see several processes, using ps xuawww, still running in the 
> >machine)... Somehow SGE lost control...
> >
> >- The logs on the machine showed something like:
> >   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
> >   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
> >   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
> >   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
> >"active_jobs/330153.1"
> >   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
> >active_jobs/330153.1 has pid "16717" and is not alive
> >   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
> >shepherd for job 330153.1: "exit_status" file is empty
> >   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
> >"active_jobs/330153.1/usage" for job 330153.1: No such file or directory
> >   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
> >status 19
> >
> >Any hits?
> >Cheers
> >Goncalo
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list