[GE users] job won't timeout with -l h_rt

Bill Comisky bcomisky at pobox.com
Tue Mar 28 01:03:10 BST 2006

I use SGE to run task arrays of jobs.  I've been using qsub with:

-l h_rt=300

for example to kill any jobs that take longer than I want (5 min for 
above).  I tested with a sleep script and it seemed to work great, but 
I've encountered jobs that won't be killed.  This keeps the task array 
from finishing and holds everything up.

I've turned off most accounting/logging because I run many small jobs.. 
but I'd like to know if there's anything I can do to post-mortem this run 
to see why it failed and/or why the scheduler won't kill it.  The job in 
question is still sitting in the queue, and it shows that it is running. 
It's the jobID 14379 on node036 below:

$ qstat -q all.q at node036
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
   14379 0.50098 clens1w    user         r     03/26/2006 14:28:02 all.q at node036                      1 2754
   15421 0.50049 clens2     user         t     03/27/2006 13:17:46 all.q at node036                      1 239
   15421 0.50049 clens2     user         qw    03/27/2006 13:17:05                                    1 258-1200:1

$ cat /opt/sge/default/spool/node036/active_jobs/14379.2754/trace
03/26/2006 16:52:53 [0:28890]: shepherd called with uid = 0, euid = 0
03/26/2006 16:52:53 [0:28890]: starting up 6.0u7
03/26/2006 16:52:53 [0:28890]: setpgid(28890, 28890) returned 0
03/26/2006 16:52:53 [0:28890]: no prolog script to start
03/26/2006 16:52:53 [0:28890]: forked "job" with pid 28891
03/26/2006 16:52:53 [0:28891]: pid=28891 pgrp=28891 sid=28891 old pgrp=28890 getlogin()=<no login set>

$ rsh node036 -- ps auxw | grep 2889
root     28890  0.0  0.3  2840  952 ?        D    Mar26   0:00 sge_shepherd-14379 -bg
root     28891  0.0  0.3  2840  960 ?        Ds   Mar26   0:00 sge_shepherd-14379 -bg

I can run this job manually and it works fine and exits in a matter of 
seconds.  Any pointers for what I'm doing wrong or how to diagnose the 
problem?  For the work I'm doing it would be much better to hard kill this 
job or any like it than let it hold up the completion of the task array.


Bill Comisky
bcomisky at pobox.com

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list