[GE users] shepherd exited with exit status 19

Gon?alo Borges goncalo at lip.pt
Fri May 2 21:08:17 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

By the way, looking in SGE QMASTER logs, I also see

05/02/2008 17:34:39|qmaster|sge01|E|commlib error: got pipe error 
(closing "sge01.lip.pt/schedd/1")
05/02/2008 17:34:47|qmaster|sge01|E|commlib error: endpoint is not 
unique error (endpoint "ce02.lip.pt/qstat/30496" is already connected)
05/02/2008 17:34:52|qmaster|sge01|E|can't open directory 
"resource_quotas": No such file or directory
05/02/2008 17:34:52|qmaster|sge01|E|wrong cull version, read 0x7273696f, 
but expected actual version 0x10020000
05/02/2008 17:34:52|qmaster|sge01|E|error in init_packbuffer: wrong cull 
version
05/02/2008 17:34:52|qmaster|sge01|E|not enough memory for unpacking job 
"jobs/00/0025/6239"
05/02/2008 17:34:52|qmaster|sge01|E|wrong cull version, read 0x7273696f, 
but expected actual version 0x10020000
05/02/2008 17:34:52|qmaster|sge01|E|error in init_packbuffer: wrong cull 
version
05/02/2008 17:34:52|qmaster|sge01|E|not enough memory for unpacking job 
"jobs/00/0025/6238"
05/02/2008 17:34:55|qmaster|sge01|E|commlib error: got read error 
(closing "ce02.lip.pt/qstat/2")

I'm not sure if these are related to my previous problems, but these log 
seems very suspecious that something went wrong!

Cheers
Goncalo



Gon?alo Borges wrote:
> Hi All,
>
> I'm seeing the following problem in SGE V6u3_1:
>
> - A user started to complain that his jobs were not being executed 
> although there were free machines;
>
> - Indeed the machines were free (no jobs were shown by qstat) but 
> under very heavy high load, above the defined threshold, not allowing 
> new jobs to be executed.
>
> - The load was originated by old jobs not properly killed (we could 
> see several processes, using ps xuawww, still running in the 
> machine)... Somehow SGE lost control...
>
> - The logs on the machine showed something like:
>    05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
>    05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
>    05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
>    05/02/2008 18:51:36|execd|lflip19|I|found directory of job 
> "active_jobs/330153.1"
>    05/02/2008 18:51:36|execd|lflip19|I|shepherd for job 
> active_jobs/330153.1 has pid "16717" and is not alive
>    05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of 
> shepherd for job 330153.1: "exit_status" file is empty
>    05/02/2008 18:51:36|execd|lflip19|E|can't open usage file 
> "active_jobs/330153.1/usage" for job 330153.1: No such file or directory
>    05/02/2008 18:51:36|execd|lflip19|E|shepherd exited with exit 
> status 19
>
> Any hits?
> Cheers
> Goncalo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list