[GE users] shepherd exited with exit status 19

Gon?alo Borges goncalo at lip.pt
Tue May 6 19:11:46 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

This is happening over and over again!!! Shepherd is dying with a 
message similar to:

   05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
   05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
   05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
   05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
"active_jobs/330153.1"
   05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
active_jobs/330153.1 has pid "16717" and is not alive
   05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd 
for job 330153.1: "exit_status" file is empty
   05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
"active_jobs/330153.1/usage" for job 330153.1: No such file or directory
   05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit status 19

NFS is not causing the problem because the spool directory is on local disk!

After shepherd death, SGE thinks the job finished, and allows new jobs 
to enter. However, the processes which were controled by the previous 
alive shepherd are still there...
It comes to a point where the machines enter in a very, very high load!!!!

To whom can I ask for more technical help on this issue? We really need 
help on this...

Goncalo


Gon?alo Borges wrote:
> Hi All,
>
> I'm seeing the following problem in SGE V6u3_1:
>
> - A user started to complain that his jobs were not being executed 
> although there were free machines;
>
> - Indeed the machines were free (no jobs were shown by qstat) but 
> under very heavy high load, above the defined threshold, not allowing 
> new jobs to be executed.
>
> - The load was originated by old jobs not properly killed (we could 
> see several processes, using ps xuawww, still running in the 
> machine)... Somehow SGE lost control...
>
> - The logs on the machine showed something like:
>    05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>    05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>    05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>    05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
> "active_jobs/330153.1"
>    05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
> active_jobs/330153.1 has pid "16717" and is not alive
>    05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
> shepherd for job 330153.1: "exit_status" file is empty
>    05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>    05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
> status 19
>
> Any hits?
> Cheers
> Goncalo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list