[GE users] Error message: Can't read usage file
Harald.Pollinger at Sun.COM
Thu Aug 23 22:40:27 BST 2007
[ The following text is in the "ISO-8859-15" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Petra Kogel wrote:
> I've experimented with
> - killing the shepherd: signal 15 or 9
> - killing the job script: signal 15 or 9
> They produce the following messages in the qmaster log, none
> of which corresponds to the one created by the "disappearing"
> 08/23/2007 09:46:39|qmaster|swarm-ge|W|job 1953330.1 failed on host
> bee-ge24 before writing exit_status because: shepherd exited with exit
> status 19
> 08/23/2007 09:49:26|qmaster|swarm-ge|W|job 1953437.1 failed on host
> bee-ge32 assumedly after job because: job 1953437.1 died through signal
> TERM (15)
> 08/23/2007 09:51:09|qmaster|swarm-ge|W|job 1953464.1 failed on host
> bee-ge20 before writing exit_status because: shepherd exited with exit
> status 19
> 08/23/2007 09:53:19|qmaster|swarm-ge|W|job 1953513.1 failed on host
> bee-ge32 assumedly after job because: job 1953513.1 died through signal
> KILL (9)
> What else could I try to debug this?
Hmm.. sorry, I should have tested it and not just have looked into the
code. Of course the execd recognizes when it's child gets killed.
I can't reproduce the problem exactly. If I submit an invalid binary (a
binary compiled for another architecture) I get:
08/23/2007 23:02:53|execd|dain|E|shepherd of job 59.1 exited with exit
status = 27
08/23/2007 23:02:53|execd|dain|W|reaping job "59" ptf complains: Job
does not exist
08/23/2007 23:02:53|execd|dain|E|can't open usage file
"active_jobs/59.1/usage" for job 59.1: No such file or directory
08/23/2007 23:02:53|execd|dain|E|08/23/2007 23:02:53 [151085:25430]:
execvp(testprg, "testprg") failed: Invalid argument
But this seems to be the same problem: The shepherd thinks it can start
the job (job binary exists, permissions permit to start it), but then it
fails starting it, and the execd thinks the job was started and there
should be a usage file.
So I still think here either the job itself is not executable or the
shepherd dies immediately before it starts the job. The shepherd
consists of two processes there, the parent is monitoring the job, the
child is preparing everything for the job and will finally become the
job by calling exec(). Possibly the child shepherd dies for some reason,
or the job is not executable.
So if you could collect more informations about the jobs that fail, it
would be very helpful: How was the job submitted (complete qsub command
line incl. options from all sge_request files), are there more lines
about this job in the messages files, was an admin mail sent because of
this job, what does "qacct -j <job-id>" tell about this job?
> Many thanks,
> Petra Kogel wrote:
>> thanks for this; I'll pursue this.
>> Kind regards,
>> Harald Pollinger wrote:
>>> To reproduce this error, just "kill -9" the sge_shepherd of the job.
>>> Then it has no chance to write the usage file and the execd will
>>> vainly search for it.
>>> So my guess is: The sge_shepherd dies and could leave a core dump if
>>> your system is configured this way.
>>> Petra Kogel wrote:
>>>> from time to time, we have jobs "disappearing" without leaving an
>>>> or error file. These jobs run fine if re-submitted. When they do not
>>>> - they execute our custom prolog, leaving a start time stamp in
>>>> our custom log
>>>> - they execute our custom epilog, leaving an end time stamp in
>>>> our custom log
>>>> - they log an error on the node's local message file, for example
>>>> 08/19/2007 07:01:12|execd|bee-ge08|E|can't open usage file
>>>> "active_jobs/1882417.1/usage" for job 1882417.1: No such file or
>>>> 08/19/2007 07:01:12|execd|bee-ge08|E|can't read usage file for job
>>>> - they log an error in the qmaster messages file, for example
>>>> 08/19/2007 07:01:12|qmaster|swarm-ge|W|job 1882417.1 failed on host
>>>> bee-ge08 assumedly after job because: can't read usage file for job
>>>> For these "disappearing jobs", the time difference between start
>>>> and end as logged by prolog/epilog is usually one second (if that,
>>>> sometimes both timestamps are the same). Normally, these jobs
>>>> would take several minutes to execute and complete.
>>>> Would anybody know what could provoke this error message / what
>>>> could be happening to the jobs?
>>>> Our installation is sge6.0u8 on a SuSE linux cluster.
>>>> Many thanks for your help,
Sun Microsystems GmbH Harald Pollinger
Dr.-Leo-Ritter-Str. 7 N1 Grid Engine Engineering
D-93049 Regensburg Phone: +49 (0)941 3075-209 (x60209)
Germany Fax: +49 (0)941 3075-222 (x60222)
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users