[GE users] Kill Jobs that appear to be doing nothing

reuti reuti at staff.uni-marburg.de
Mon Jan 11 14:52:51 GMT 2010


Hi,

Am 11.01.2010 um 12:01 schrieb cgull:

> We recently had a couple of parallel jobs that stopped running.

- what parallel library?
- tightly integrated into SGE?

If the parallel job just hangs there, it may be a programming issue.  
They should output an error message when they lose contact to the  
slaves or are stalled for other reasons.


> But the job hung and did not finish correctly. All the nodes  
> related to this job once the job hung then had a load average of  
> 0.00. Is there anyway functionality in SGE that if nodes are idle  
> for a length of time say two hours and that they should have a job  
> running on them.

When the resources are still allocated to a (from SGE's point of  
view) running job: no. The parallel job could start a task on these  
nodes in the next second.


> That the job on the machine would be killed, or a notification  
> message sent?

You can specify -l h_rt=... and give an estimated runtime for the  
job. After this time has elapsed the job will be killed and resources  
returned.

-- Reuti


> Thanks for your help.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=238061
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238109

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list