[GE users] Jobs in "hold" state disappear. Debugging help?

reuti reuti at staff.uni-marburg.de
Wed Mar 24 23:02:09 GMT 2010


Am 24.03.2010 um 22:32 schrieb gutnik:

>> Okay, the let's have a look at the messages file of the qmaster:
>> $SGE_ROOT/default/spool/messages (or a local spool directory if
>> configured) Any hint of a `qdel`?
>
> No, no qdel in the qmaster's messages file.
>
> However, I tried qacct again: qacct -j <simjob>
>  works, and gives me a bunch of information.
> qacct -j <cleanup>
>  says the job was not found. But I certainly saw it in the queue.  
> Under
> what circumstances would qacct say "error: job id 7858 not found" if  
> qacct -s z
> lists it?

When the accouting entry is missing the job never ran, hence you  
wouldn't get an email anyway. Maybe a prolog for the job crashed. Then  
there is a chance that the admin of the cluster got an email which was  
send from the qmaster node. Otherwise: hard to investigate from  
remote...


>>> How do I do that? (Ideally, I'd like email for each job, and each
>>> change of
>>> status and reason.)
>>
>> Just put a line.
>>
>> -m bea
>>
>> into this file and hope that proper email handling were setup. With  
>> -M
>> an optional target address different from the local user could be
>> specified.
>
> Proper email handling was, apparently, not set up. :-/
> I'll see if I can work on that... but meanwhile, any thoughts
> on the qacct behavior?


BTW: The emails for jobs are send from the nodes, on a private cluster  
this means that they must be relayed by the headnode.

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251234

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list