[GE users] Jobs remaining in d state

Duncan Mortimer duncan at fmrib.ox.ac.uk
Tue May 16 10:56:43 BST 2006

We occasionally see a similar situation, the shepherd hangs around  
and can't be killed (HUP or KILL) - looking through the process  
listing the child process can be found and appears to be a zombie,  
having no parent. This is under Mac OS X.
Our only solution to clear the locked slot is to reboot the cluster  

On 8 May 2006, at 12:37, Jean-Paul Minet wrote:

> Ron,
>> If the node the job runs on is not reachable by qmaster, then
>> you will encounter that. You can use "qdel -f" to force a
>> cleanup.
> The node is reachable by qmaster: another job is running on the  
> same (biproc) node and its cpu usage is correctly reported/updated  
> through qstat -ext.  To me there is another problem somewhere.   
> Isn't the fact that the qrsh wrapper/link and machinefile have been  
> removed from the $TMP directory an indicator that something was  
> done in response to the qdel command, but could not be performed  
> till completion ?
> Jean-paul
>>  -Ron
>> --- Jean-Paul Minet <minet at cism.ucl.ac.be> wrote:
>>> Hi,
>>> Regularly, I see jobs deleted by users (qdel) remaining in the
>>> d state.  For example, I have in the qmaster message file:
>>> 05/05/2006 14:12:55|qmaster|lmsp|I|hermet has registered the
>>> job 11025 for deletion
>>> and three days later, qstat shows
>>> 11025 0.00581 run.para hermet  dr 05/05/2006 09:40:43
>>> all.q at lmexec-82
>>> There is no user process left running on the mpich head/master
>>> node nor on child/slave nodes.  On the head node, the rsh link  
>>> and machine
>>> file generated by the startmpi.sh script have been removed from the
>>> /tmp/11025.1.all.q directory, but a qrsh_client_cache file  
>>> remains there.
>>> Any clue of where to look for additional info (what prevents
>>> SGE from completing job deletion) ?
>>> Thanks
>>> Jean-Paul

Duncan Mortimer
duncan at fmrib.ox.ac.uk

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list