[GE users] sge jobs when a node crashes

Iwona Sakrejda isakrejda at lbl.gov
Mon May 1 21:41:18 BST 2006



Reuti wrote:


> Am 01.05.2006 um 21:49 schrieb Jinal Jhaveri:

>> Recently I am seeing a situation where, when a node on which a job  is 
>> running crashes but  the job is still shown  in "r" state in qstat.
>>

> 
Reuti wrote:

> 
> can you try using the "reschedule unknown" option and submit the jobs  
> with "-r y"? Please have a look into "man sge_conf".
> 

But one has to be careful with this option. Sometimes it *is* a network or the
dying daemon on the host that cause lack of response and the job happily runs
to completion so restarting it might lead to confusion. Also users who
want to take advantage of this option should make sure that all their
file-writting transactions are done cleanly so the restarted job will not
continue writting to a file started by the previous job and produce
unexpected results.  I discovered the hard way that it created
more trouble than profit and we settled down for monitoring forced
deletions.

Iwona



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list