[GE users] job won't timeout with -l h_rt

Bill Comisky bcomisky at pobox.com
Tue Mar 28 21:01:55 BST 2006


On Tue, 28 Mar 2006, Reuti wrote:

> Hi,
>
> Am 28.03.2006 um 02:03 schrieb Bill Comisky:
>
>> I use SGE to run task arrays of jobs.  I've been using qsub with:
>> 
>> -l h_rt=300
>> 
>> for example to kill any jobs that take longer than I want (5 min for 
>> above).  I tested with a sleep script and it seemed to work great, but I've 
>> encountered jobs that won't be killed.  This keeps the task array from 
>> finishing and holds everything up.
>> 
>> I've turned off most accounting/logging because I run many small jobs.. but 
>> I'd like to know if there's anything I can do to post-mortem this run to 
>> see why it failed and/or why the scheduler won't kill it.  The job in 
>> question is still sitting in the queue, and it shows that it is running. 
>> It's the jobID 14379 on node036 below:
>> 
>> $ qstat -q all.q at node036
>> job-ID  prior   name       user         state submit/start at     queue 
>> slots ja-task-ID
>> -----------------------------------------------------------------------------------------------------------------
>>   14379 0.50098 clens1w    user         r     03/26/2006 14:28:02
>>   all.q at node036                      1 2754
>>   15421 0.50049 clens2     user         t     03/27/2006 13:17:46
>>   all.q at node036                      1 239
>>   15421 0.50049 clens2     user         qw    03/27/2006 13:17:05
>>   1 258-1200:1
>> 
>> $ cat /opt/sge/default/spool/node036/active_jobs/14379.2754/trace
>> 03/26/2006 16:52:53 [0:28890]: shepherd called with uid = 0, euid = 0
>> 03/26/2006 16:52:53 [0:28890]: starting up 6.0u7
>> 03/26/2006 16:52:53 [0:28890]: setpgid(28890, 28890) returned 0
>> 03/26/2006 16:52:53 [0:28890]: no prolog script to start
>> 03/26/2006 16:52:53 [0:28890]: forked "job" with pid 28891
>> 03/26/2006 16:52:53 [0:28891]: pid=28891 pgrp=28891 sid=28891 old 
>> pgrp=28890 getlogin()=<no login set>
>> 
>> $ rsh node036 -- ps auxw | grep 2889
>> root     28890  0.0  0.3  2840  952 ?        D    Mar26   0:00 
>> sge_shepherd-14379 -bg
>> root     28891  0.0  0.3  2840  960 ?        Ds   Mar26   0:00 
>> sge_shepherd-14379 -bg
>> 
>
> status D means:
>
> D    Uninterruptible sleep (usually IO)
>

Good to know..

> Are you facing any harddisk or NFS problems in the cluster? Are the spool 
> directories local on the nodes or NFS mounted in e.g. 
> /usr/sge/default/spool/...?

The nodes are all diskless with spool directories NFS mounted.  I haven't 
noticed any other issues with disk/NFS, other than I'll get one of these 
task array jobs out of thousands of jobs and hundreds of task arrays that 
will do this.. there have been two in the last few days.

Short of having a cron job periodically qdel jobs with this status, do you 
know of any SGE options which I could use to cull these jobs?  Or maybe an 
NFS mount option?

Thanks for your help,
Bill

--
Bill Comisky
bcomisky at pobox.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list