[GE users] job won't timeout with -l h_rt
bcomisky at pobox.com
Wed Mar 29 16:43:27 BST 2006
On Wed, 29 Mar 2006, Reuti wrote:
> Hey again,
>> > Are you facing any harddisk or NFS problems in the cluster? Are the spool
>> > directories local on the nodes or NFS mounted in e.g.
>> > /usr/sge/default/spool/...?
>> The nodes are all diskless with spool directories NFS mounted. I haven't
>> noticed any other issues with disk/NFS, other than I'll get one of these
>> task array jobs out of thousands of jobs and hundreds of task arrays that
>> will do this.. there have been two in the last few days.
>> Short of having a cron job periodically qdel jobs with this status, do you
>> know of any SGE options which I could use to cull these jobs? Or maybe an
>> NFS mount option?
> are the directories hard-/soft-mounted, or with auomounter? Although I have
> no immediate clue, maybe you can post some lines of your /etx/exports of the
> server and /etc/fstab of the nodes.
The entries in the NFS server exports file:
and the corresponding node fstab entries:
thor-sharedfs:/opt /opt nfs nfsvers=2 0 0
thor-sharedfs:/tmpnfs /tmpnfs nfs nfsvers=2 0 0
thor-sharedfs:/home /home nfs nfsvers=2 0 0
I did a little more google'ing on 'uninterruptible sleep' processes, they
can't be killed, even with 'kill -9'. 'qdel' won't remove them from the
queue (presumably since they can't be killed), but 'qdel -f' will remove
them from the queue though the processes aren't killed on the node.
Doing a quick search for these processes on our cluster find 4 processes
$ pdsh -- ps auxw | grep ' D '
node006: root 9274 0.0 0.0 8448 1076 ? D Mar28 0:00 sge_shepherd-15823 -bg
node036: root 28890 0.0 0.3 2840 952 ? D Mar26 0:00 sge_shepherd-14379 -bg
node030: root 23771 0.0 0.3 2840 952 ? D Mar23 0:00 sge_shepherd-13145 -bg
node031: root 6820 0.0 0.3 2840 952 ? D Mar23 0:00 sge_shepherd-13270 -bg
I may have to reboot to get rid of them, but in the meantime I would still
like to force SGE to 'qdel -f' these jobs if they exceed the -l h_rt time
bcomisky at pobox.com
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users