[GE users] job won't timeout with -l h_rt

Reuti reuti at staff.uni-marburg.de
Tue Mar 28 07:17:21 BST 2006


Hi,

Am 28.03.2006 um 02:03 schrieb Bill Comisky:

> I use SGE to run task arrays of jobs.  I've been using qsub with:
>
> -l h_rt=300
>
> for example to kill any jobs that take longer than I want (5 min  
> for above).  I tested with a sleep script and it seemed to work  
> great, but I've encountered jobs that won't be killed.  This keeps  
> the task array from finishing and holds everything up.
>
> I've turned off most accounting/logging because I run many small  
> jobs.. but I'd like to know if there's anything I can do to post- 
> mortem this run to see why it failed and/or why the scheduler won't  
> kill it.  The job in question is still sitting in the queue, and it  
> shows that it is running. It's the jobID 14379 on node036 below:
>
> $ qstat -q all.q at node036
> job-ID  prior   name       user         state submit/start at      
> queue                          slots ja-task-ID
> ---------------------------------------------------------------------- 
> -------------------------------------------
>   14379 0.50098 clens1w    user         r     03/26/2006 14:28:02  
> all.q at node036                      1 2754
>   15421 0.50049 clens2     user         t     03/27/2006 13:17:46  
> all.q at node036                      1 239
>   15421 0.50049 clens2     user         qw    03/27/2006  
> 13:17:05                                    1 258-1200:1
>
> $ cat /opt/sge/default/spool/node036/active_jobs/14379.2754/trace
> 03/26/2006 16:52:53 [0:28890]: shepherd called with uid = 0, euid = 0
> 03/26/2006 16:52:53 [0:28890]: starting up 6.0u7
> 03/26/2006 16:52:53 [0:28890]: setpgid(28890, 28890) returned 0
> 03/26/2006 16:52:53 [0:28890]: no prolog script to start
> 03/26/2006 16:52:53 [0:28890]: forked "job" with pid 28891
> 03/26/2006 16:52:53 [0:28891]: pid=28891 pgrp=28891 sid=28891 old  
> pgrp=28890 getlogin()=<no login set>
>
> $ rsh node036 -- ps auxw | grep 2889
> root     28890  0.0  0.3  2840  952 ?        D    Mar26   0:00  
> sge_shepherd-14379 -bg
> root     28891  0.0  0.3  2840  960 ?        Ds   Mar26   0:00  
> sge_shepherd-14379 -bg
>

status D means:

D    Uninterruptible sleep (usually IO)

Are you facing any harddisk or NFS problems in the cluster? Are the  
spool directories local on the nodes or NFS mounted in e.g. /usr/sge/ 
default/spool/...?

-- Reuti


> I can run this job manually and it works fine and exits in a matter  
> of seconds.  Any pointers for what I'm doing wrong or how to  
> diagnose the problem?  For the work I'm doing it would be much  
> better to hard kill this job or any like it than let it hold up the  
> completion of the task array.
>
> thanks,
> Bill
>
> --
> Bill Comisky
> bcomisky at pobox.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list