[GE users] job won't timeout with -l h_rt

Reuti reuti at staff.uni-marburg.de
Wed Mar 29 18:14:11 BST 2006


Am 29.03.2006 um 17:43 schrieb Bill Comisky:

> On Wed, 29 Mar 2006, Reuti wrote:
>
>> Hey again,
>>
>>> > Are you facing any harddisk or NFS problems in the cluster? Are  
>>> the spool > directories local on the nodes or NFS mounted in e.g.  
>>> > /usr/sge/default/spool/...?
>>> The nodes are all diskless with spool directories NFS mounted.  I  
>>> haven't noticed any other issues with disk/NFS, other than I'll  
>>> get one of these task array jobs out of thousands of jobs and  
>>> hundreds of task arrays that will do this.. there have been two  
>>> in the last few days.
>>> Short of having a cron job periodically qdel jobs with this  
>>> status, do you know of any SGE options which I could use to cull  
>>> these jobs?  Or maybe an NFS mount option?
>>
>> are the directories hard-/soft-mounted, or with auomounter?  
>> Although I have no immediate clue, maybe you can post some lines  
>> of your /etx/exports of the server and /etc/fstab of the nodes.
>
> The entries in the NFS server exports file:
>
> /opt     172.16.0.0/255.255.252.0(rw,no_root_squash,async)
> /tmpnfs 172.16.0.0/255.255.252.0(rw,no_root_squash,async)
> /home   172.16.0.0/255.255.252.0(rw,no_root_squash,async)
>
> and the corresponding node fstab entries:
>
> thor-sharedfs:/opt     /opt             nfs     nfsvers=2       0 0
> thor-sharedfs:/tmpnfs  /tmpnfs           nfs     nfsvers=2       0 0
> thor-sharedfs:/home     /home   nfs     nfsvers=2       0 0

What about using udp and version3?

> I did a little more google'ing on 'uninterruptible sleep'  
> processes, they can't be killed, even with 'kill -9'.  'qdel' won't  
> remove them from the queue (presumably since they can't be killed),  
> but 'qdel -f' will remove them from the queue though the processes  
> aren't killed on the node.
> Doing a quick search for these processes on our cluster find 4  
> processes total:
>
> $ pdsh -- ps auxw | grep ' D '
> node006: root      9274  0.0  0.0  8448 1076 ?        D    Mar28    
> 0:00 sge_shepherd-15823 -bg
> node036: root     28890  0.0  0.3  2840  952 ?        D    Mar26    
> 0:00 sge_shepherd-14379 -bg
> node030: root     23771  0.0  0.3  2840  952 ?        D    Mar23    
> 0:00 sge_shepherd-13145 -bg
> node031: root      6820  0.0  0.3  2840  952 ?        D    Mar23    
> 0:00 sge_shepherd-13270 -bg

Can you check with "lsof" which files are opened from these  
processes? - Reuti


> I may have to reboot to get rid of them, but in the meantime I  
> would still like to force SGE to 'qdel -f' these jobs if they  
> exceed the -l h_rt time limit.
>
> Bill
>
> --
> Bill Comisky
> bcomisky at pobox.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list