[GE users] Kill Jobs that appear to be doing nothing

reuti reuti at staff.uni-marburg.de
Thu Jan 14 10:43:40 GMT 2010


Am 11.01.2010 um 16:57 schrieb cgull:

> Hi thanks for your reply.
> By the sounds of it we will probably end up using the -l h_rt=.
> Some of these jobs run for 60 hours or so. If this failure occurs  
> in the first few hours, it could be a lot of cluster time wasted?

yes, SGE cannot judge what to do on its own.

> Is there a better way that the -l h_rt?

You could use a cron task, which will check the cpu consumption of  
all jobs in the system, and when it does not increase kill the job.  
The necessary information you can get from:

$ qstat -j <job_id>

in the line "usage 1: cpu=00:24:49 ...". If the value does not change  
after a certain time (I think it depends on the "load_report_time"  
from SGE's configuration), the job is eligible to be killed.

-- Reuti

> Thanks again for your time,
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=238124
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list