[GE users] sge_shepard not dying

Andy Schwierskott andy.schwierskott at sun.com
Wed Jun 20 09:26:50 BST 2007


Hi,

is the execd spool directory located on NFS? The most common reason why a
process can't be killed with SIGKILL is a NFS problem where a process tries
to make some I/O. Or more technically speaking: the process is in an
'uninterruptible sleep'. As far as I know a simple test case to reprodcue
this behavior is to mount an NFS file system with the "hard" option, do some
I/O,, e.g. do a long lasting "cat BIGFILE" and the unplug the network cable
or kill the NFS server: You won't be able to kill the "cat" process)

Other reasons for not being able to kill a process could be kernel bugs where
a process stays in such an uninterruptible sleep whereit shouldn't.

An "strace" on the shepherd processes might reveal what they currently do.

I'm kind of surprises that the sge_shepherd processes have different owners
- what's the background there?

Andy

> "kill -9" doesn't kill them.
>
> On Jun 19, 2007, at 12:40 PM, Valentin Ruano wrote:
>
>> Well, I reckon that you can always kill them individually using the command 
>> KILL.
>> 
>> First give them the chance to terminate themselves:
>> $ kill -TERM <list of pids>
>> 
>> If they resist then force the to die:
>> $ kill -KILL <list of pids>
>> 
>> V.
>> 
>> 
>> On 6/19/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>> I have one user who submits jobs, sometimes deletes them and leaves
>> the compute nodes full of sge_sheperd-nnn -bg jobs.
>> 
>> [root at compute-0-1 ~]# ps -ef | grep sge*
>> sge       4207     1  0 May09 ?        03:07:50 /opt/gridengine/bin/
>> lx26-amd64/sge_execd
>> sge      19994  4207  0 May23 ?        00:00:03 sge_shepherd-176 -bg
>> sge      20070  4207  0 May23 ?        00:00:03 sge_shepherd-181 -bg
>> sge      21361  4207  0 May24 ?        00:00:01 sge_shepherd-184 -bg
>> nanguyen 21362 21361  0 May24 ?        00:00:00 sge_shepherd-184 -bg
>> sge      28576  4207  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>> nanguyen 28577 28576  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>> sge      28584  4207  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>> nanguyen 28585 28584  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>> sge      28652  4207  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>> nanguyen 28653 28652  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>> sge      31052  4207  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>> nanguyen 31053 31052  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>> root      3220  3085  0 12:03 pts/1    00:00:00 grep sge*
>> 
>> 
>> Until this are cleared from the node, jobs won't run.  I know that I
>> can clear the jobs by rebooting the compute node, but there must be a
>> cleaner way of clearing the sge_shepard jobs.
>> 
>> Any idea how the user is doing this?  Other users do not leave the
>> sge_shepard jobs around.
>> 
>> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list