[GE users] shepherd exited with exit status 19

Gon?alo Borges goncalo at lip.pt
Fri May 2 20:44:03 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi All,

I'm seeing the following problem in SGE V6u3_1:

- A user started to complain that his jobs were not being executed 
although there were free machines;

- Indeed the machines were free (no jobs were shown by qstat) but under 
very heavy high load, above the defined threshold, not allowing new jobs 
to be executed.

- The load was originated by old jobs not properly killed (we could see 
several processes, using ps xuawww, still running in the machine)... 
Somehow SGE lost control...

- The logs on the machine showed something like:
    05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
    05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
    05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
    05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
"active_jobs/330153.1"
    05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
active_jobs/330153.1 has pid "16717" and is not alive
    05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd 
for job 330153.1: "exit_status" file is empty
    05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
"active_jobs/330153.1/usage" for job 330153.1: No such file or directory
    05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit status 19

Any hits?
Cheers
Goncalo

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list