[GE users] job won't timeout with -l h_rt

Bill Comisky bcomisky at pobox.com
Wed Mar 29 16:43:27 BST 2006


On Wed, 29 Mar 2006, Reuti wrote:

> Hey again,
>
>> > Are you facing any harddisk or NFS problems in the cluster? Are the spool 
>> > directories local on the nodes or NFS mounted in e.g. 
>> > /usr/sge/default/spool/...?
>> 
>> The nodes are all diskless with spool directories NFS mounted.  I haven't 
>> noticed any other issues with disk/NFS, other than I'll get one of these 
>> task array jobs out of thousands of jobs and hundreds of task arrays that 
>> will do this.. there have been two in the last few days.
>> 
>> Short of having a cron job periodically qdel jobs with this status, do you 
>> know of any SGE options which I could use to cull these jobs?  Or maybe an 
>> NFS mount option?
>
> are the directories hard-/soft-mounted, or with auomounter? Although I have 
> no immediate clue, maybe you can post some lines of your /etx/exports of the 
> server and /etc/fstab of the nodes.

The entries in the NFS server exports file:

/opt     172.16.0.0/255.255.252.0(rw,no_root_squash,async)
/tmpnfs 172.16.0.0/255.255.252.0(rw,no_root_squash,async)
/home   172.16.0.0/255.255.252.0(rw,no_root_squash,async)

and the corresponding node fstab entries:

thor-sharedfs:/opt     /opt             nfs     nfsvers=2       0 0
thor-sharedfs:/tmpnfs  /tmpnfs           nfs     nfsvers=2       0 0
thor-sharedfs:/home     /home   nfs     nfsvers=2       0 0

I did a little more google'ing on 'uninterruptible sleep' processes, they 
can't be killed, even with 'kill -9'.  'qdel' won't remove them from the 
queue (presumably since they can't be killed), but 'qdel -f' will remove 
them from the queue though the processes aren't killed on the node.
Doing a quick search for these processes on our cluster find 4 processes 
total:

$ pdsh -- ps auxw | grep ' D '
node006: root      9274  0.0  0.0  8448 1076 ?        D    Mar28   0:00 sge_shepherd-15823 -bg
node036: root     28890  0.0  0.3  2840  952 ?        D    Mar26   0:00 sge_shepherd-14379 -bg
node030: root     23771  0.0  0.3  2840  952 ?        D    Mar23   0:00 sge_shepherd-13145 -bg
node031: root      6820  0.0  0.3  2840  952 ?        D    Mar23   0:00 sge_shepherd-13270 -bg

I may have to reboot to get rid of them, but in the meantime I would still 
like to force SGE to 'qdel -f' these jobs if they exceed the -l h_rt time 
limit.

Bill

--
Bill Comisky
bcomisky at pobox.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list