[GE users] Error message: Can't read usage file

Petra Kogel Petra.Kogel at ecmwf.int
Thu Aug 23 10:57:13 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Harald,

I've experimented with
- killing the shepherd: signal 15 or 9
- killing the job script: signal 15 or 9

They produce the following messages in the qmaster log, none
of which corresponds to the one created by the "disappearing"
jobs:

08/23/2007 09:46:39|qmaster|swarm-ge|W|job 1953330.1 failed on host 
bee-ge24 before writing exit_status because: shepherd exited with exit 
status 19
08/23/2007 09:49:26|qmaster|swarm-ge|W|job 1953437.1 failed on host 
bee-ge32 assumedly after job because: job 1953437.1 died through signal 
TERM (15)
08/23/2007 09:51:09|qmaster|swarm-ge|W|job 1953464.1 failed on host 
bee-ge20 before writing exit_status because: shepherd exited with exit 
status 19
08/23/2007 09:53:19|qmaster|swarm-ge|W|job 1953513.1 failed on host 
bee-ge32 assumedly after job because: job 1953513.1 died through signal 
KILL (9)

What else could I try to debug this?

Many thanks,
Petra

Petra Kogel wrote:
> Harald,
> 
> thanks for this; I'll pursue this.
> 
> Kind regards,
> Petra
> 
> Harald Pollinger wrote:
>> To reproduce this error, just "kill -9" the sge_shepherd of the job.
>> Then it has no chance to write the usage file and the execd will 
>> vainly search for it.
>>
>> So my guess is: The sge_shepherd dies and could leave a core dump if 
>> your system is configured this way.
>>
>> Regards,
>> Harald
>>
>>
>> Petra Kogel wrote:
>>> Hi,
>>>
>>> from time to time, we have jobs "disappearing" without leaving an output
>>> or error file. These jobs run fine if re-submitted. When they do not
>>> work
>>>
>>> - they execute our custom prolog, leaving a start time stamp in
>>>   our custom log
>>> - they execute our custom epilog, leaving an end time stamp in
>>>   our custom log
>>> - they log an error on the node's local message file, for example
>>>
>>> 08/19/2007 07:01:12|execd|bee-ge08|E|can't open usage file 
>>> "active_jobs/1882417.1/usage" for job 1882417.1: No such file or 
>>> directory
>>>
>>> 08/19/2007 07:01:12|execd|bee-ge08|E|can't read usage file for job 
>>> 1882417.1
>>>
>>> - they log an error in the qmaster messages file, for example
>>>
>>> 08/19/2007 07:01:12|qmaster|swarm-ge|W|job 1882417.1 failed on host 
>>> bee-ge08 assumedly after job because: can't read usage file for job 
>>> 1882417.1
>>>
>>> For these "disappearing jobs", the time difference between start
>>> and end as logged by prolog/epilog is usually one second (if that,
>>> sometimes both timestamps are the same). Normally, these jobs
>>> would take several minutes to execute and complete.
>>>
>>> Would anybody know what could provoke this error message / what
>>> could be happening to the jobs?
>>>
>>> Our installation is sge6.0u8 on a SuSE linux cluster.
>>>
>>> Many thanks for your help,
>>>
>>> Petra
>>>
>>>
>>>
>>
>>
> 

-- 

Petra Kogel, Senior Systems Analyst, Servers & Desktops Section
European Centre for Medium-Range Weather Forecasts (ECMWF)
Shinfield Park, Reading, Berkshire, RG2 9AX, UK (http://www.ecmwf.int)
Email: pkogel at ecmwf.int Telephone: (++44) 118 9499364

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list