[GE users] sge_shepard not dying

Andy Schwierskott andy.schwierskott at sun.com
Wed Jun 20 09:42:32 BST 2007


Hi,

I found a nice to read Q&A on "How to kill a process in uninterruptible
sleep state":

   http://linuxgazette.net/issue83/tag/6.html

On Linux a process in an uninterruptible sleep is in "D" state.

Do a

   ps auxww | egrep "PID|sge_"

and check for the letter in the "STAT" column. See the ps(1) man page for a
complete explanation of process states.

Andy

> Hi,
>
> is the execd spool directory located on NFS? The most common reason why a
> process can't be killed with SIGKILL is a NFS problem where a process tries
> to make some I/O. Or more technically speaking: the process is in an
> 'uninterruptible sleep'. As far as I know a simple test case to reprodcue
> this behavior is to mount an NFS file system with the "hard" option, do some
> I/O,, e.g. do a long lasting "cat BIGFILE" and the unplug the network cable
> or kill the NFS server: You won't be able to kill the "cat" process)
>
> Other reasons for not being able to kill a process could be kernel bugs where
> a process stays in such an uninterruptible sleep whereit shouldn't.
>
> An "strace" on the shepherd processes might reveal what they currently do.
>
> I'm kind of surprises that the sge_shepherd processes have different owners
> - what's the background there?
>
> Andy
>
>> "kill -9" doesn't kill them.
>> 
>> On Jun 19, 2007, at 12:40 PM, Valentin Ruano wrote:
>> 
>>> Well, I reckon that you can always kill them individually using the 
>>> command KILL.
>>> 
>>> First give them the chance to terminate themselves:
>>> $ kill -TERM <list of pids>
>>> 
>>> If they resist then force the to die:
>>> $ kill -KILL <list of pids>
>>> 
>>> V.
>>> 
>>> 
>>> On 6/19/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>> I have one user who submits jobs, sometimes deletes them and leaves
>>> the compute nodes full of sge_sheperd-nnn -bg jobs.
>>> 
>>> [root at compute-0-1 ~]# ps -ef | grep sge*
>>> sge       4207     1  0 May09 ?        03:07:50 /opt/gridengine/bin/
>>> lx26-amd64/sge_execd
>>> sge      19994  4207  0 May23 ?        00:00:03 sge_shepherd-176 -bg
>>> sge      20070  4207  0 May23 ?        00:00:03 sge_shepherd-181 -bg
>>> sge      21361  4207  0 May24 ?        00:00:01 sge_shepherd-184 -bg
>>> nanguyen 21362 21361  0 May24 ?        00:00:00 sge_shepherd-184 -bg
>>> sge      28576  4207  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>>> nanguyen 28577 28576  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>>> sge      28584  4207  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>>> nanguyen 28585 28584  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>>> sge      28652  4207  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>>> nanguyen 28653 28652  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>>> sge      31052  4207  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>>> nanguyen 31053 31052  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>>> root      3220  3085  0 12:03 pts/1    00:00:00 grep sge*
>>> 
>>> 
>>> Until this are cleared from the node, jobs won't run.  I know that I
>>> can clear the jobs by rebooting the compute node, but there must be a
>>> cleaner way of clearing the sge_shepard jobs.
>>> 
>>> Any idea how the user is doing this?  Other users do not leave the
>>> sge_shepard jobs around.
>>> 
>>> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list