[GE users] Jobs remaining in d state

Duncan Mortimer duncan at fmrib.ox.ac.uk
Tue May 16 10:56:43 BST 2006


We occasionally see a similar situation, the shepherd hangs around  
and can't be killed (HUP or KILL) - looking through the process  
listing the child process can be found and appears to be a zombie,  
having no parent. This is under Mac OS X.
Our only solution to clear the locked slot is to reboot the cluster  
node.

Duncan
On 8 May 2006, at 12:37, Jean-Paul Minet wrote:

> Ron,
>
>> If the node the job runs on is not reachable by qmaster, then
>> you will encounter that. You can use "qdel -f" to force a
>> cleanup.
>
> The node is reachable by qmaster: another job is running on the  
> same (biproc) node and its cpu usage is correctly reported/updated  
> through qstat -ext.  To me there is another problem somewhere.   
> Isn't the fact that the qrsh wrapper/link and machinefile have been  
> removed from the $TMP directory an indicator that something was  
> done in response to the qdel command, but could not be performed  
> till completion ?
>
> Jean-paul
>
>>  -Ron
>> --- Jean-Paul Minet <minet at cism.ucl.ac.be> wrote:
>>> Hi,
>>>
>>> Regularly, I see jobs deleted by users (qdel) remaining in the
>>> d state.  For example, I have in the qmaster message file:
>>>
>>> 05/05/2006 14:12:55|qmaster|lmsp|I|hermet has registered the
>>> job 11025 for deletion
>>>
>>> and three days later, qstat shows
>>>
>>> 11025 0.00581 run.para hermet  dr 05/05/2006 09:40:43
>>> all.q at lmexec-82
>>>
>>> There is no user process left running on the mpich head/master
>>> node nor on child/slave nodes.  On the head node, the rsh link  
>>> and machine
>>> file generated by the startmpi.sh script have been removed from the
>>> /tmp/11025.1.all.q directory, but a qrsh_client_cache file  
>>> remains there.
>>>
>>> Any clue of where to look for additional info (what prevents
>>> SGE from completing job deletion) ?
>>>
>>> Thanks
>>>
>>> Jean-Paul
>

Duncan Mortimer
duncan at fmrib.ox.ac.uk



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list