[GE users] Jobs in "hold" state disappear. Debugging help?
reuti at staff.uni-marburg.de
Wed Mar 24 23:02:09 GMT 2010
Am 24.03.2010 um 22:32 schrieb gutnik:
>> Okay, the let's have a look at the messages file of the qmaster:
>> $SGE_ROOT/default/spool/messages (or a local spool directory if
>> configured) Any hint of a `qdel`?
> No, no qdel in the qmaster's messages file.
> However, I tried qacct again: qacct -j <simjob>
> works, and gives me a bunch of information.
> qacct -j <cleanup>
> says the job was not found. But I certainly saw it in the queue.
> what circumstances would qacct say "error: job id 7858 not found" if
> qacct -s z
> lists it?
When the accouting entry is missing the job never ran, hence you
wouldn't get an email anyway. Maybe a prolog for the job crashed. Then
there is a chance that the admin of the cluster got an email which was
send from the qmaster node. Otherwise: hard to investigate from
>>> How do I do that? (Ideally, I'd like email for each job, and each
>>> change of
>>> status and reason.)
>> Just put a line.
>> -m bea
>> into this file and hope that proper email handling were setup. With
>> an optional target address different from the local user could be
> Proper email handling was, apparently, not set up. :-/
> I'll see if I can work on that... but meanwhile, any thoughts
> on the qacct behavior?
BTW: The emails for jobs are send from the nodes, on a private cluster
this means that they must be relayed by the headnode.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users