[GE users] job won't timeout with -l h_rt

Reuti reuti at staff.uni-marburg.de
Wed Mar 29 06:51:08 BST 2006


Hey again,

Am 28.03.2006 um 22:01 schrieb Bill Comisky:

> On Tue, 28 Mar 2006, Reuti wrote:
>
>> Hi,
>>
>> Am 28.03.2006 um 02:03 schrieb Bill Comisky:
>>
>>> I use SGE to run task arrays of jobs.  I've been using qsub with:
>>> -l h_rt=300
>>> for example to kill any jobs that take longer than I want (5 min  
>>> for above).  I tested with a sleep script and it seemed to work  
>>> great, but I've encountered jobs that won't be killed.  This  
>>> keeps the task array from finishing and holds everything up.
>>> I've turned off most accounting/logging because I run many small  
>>> jobs.. but I'd like to know if there's anything I can do to post- 
>>> mortem this run to see why it failed and/or why the scheduler  
>>> won't kill it.  The job in question is still sitting in the  
>>> queue, and it shows that it is running. It's the jobID 14379 on  
>>> node036 below:
>>> $ qstat -q all.q at node036
>>> job-ID  prior   name       user         state submit/start at      
>>> queue slots ja-task-ID
>>> -------------------------------------------------------------------- 
>>> ---------------------------------------------
>>>   14379 0.50098 clens1w    user         r     03/26/2006 14:28:02
>>>   all.q at node036                      1 2754
>>>   15421 0.50049 clens2     user         t     03/27/2006 13:17:46
>>>   all.q at node036                      1 239
>>>   15421 0.50049 clens2     user         qw    03/27/2006 13:17:05
>>>   1 258-1200:1
>>> $ cat /opt/sge/default/spool/node036/active_jobs/14379.2754/trace
>>> 03/26/2006 16:52:53 [0:28890]: shepherd called with uid = 0, euid  
>>> = 0
>>> 03/26/2006 16:52:53 [0:28890]: starting up 6.0u7
>>> 03/26/2006 16:52:53 [0:28890]: setpgid(28890, 28890) returned 0
>>> 03/26/2006 16:52:53 [0:28890]: no prolog script to start
>>> 03/26/2006 16:52:53 [0:28890]: forked "job" with pid 28891
>>> 03/26/2006 16:52:53 [0:28891]: pid=28891 pgrp=28891 sid=28891 old  
>>> pgrp=28890 getlogin()=<no login set>
>>> $ rsh node036 -- ps auxw | grep 2889
>>> root     28890  0.0  0.3  2840  952 ?        D    Mar26   0:00  
>>> sge_shepherd-14379 -bg
>>> root     28891  0.0  0.3  2840  960 ?        Ds   Mar26   0:00  
>>> sge_shepherd-14379 -bg
>>
>> status D means:
>>
>> D    Uninterruptible sleep (usually IO)
>>
>
> Good to know..
>
>> Are you facing any harddisk or NFS problems in the cluster? Are  
>> the spool directories local on the nodes or NFS mounted in e.g. / 
>> usr/sge/default/spool/...?
>
> The nodes are all diskless with spool directories NFS mounted.  I  
> haven't noticed any other issues with disk/NFS, other than I'll get  
> one of these task array jobs out of thousands of jobs and hundreds  
> of task arrays that will do this.. there have been two in the last  
> few days.
>
> Short of having a cron job periodically qdel jobs with this status,  
> do you know of any SGE options which I could use to cull these  
> jobs?  Or maybe an NFS mount option?

are the directories hard-/soft-mounted, or with auomounter? Although  
I have no immediate clue, maybe you can post some lines of your /etx/ 
exports of the server and /etc/fstab of the nodes.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list