[GE users] sge_shepard not dying

Margaret Doll Margaret_Doll at brown.edu
Tue Jun 26 14:28:03 BST 2007


I only have /state/partition1 being exported to the compute nodes.
I haven't  changed any of the default  installation of Rocks 4.2.1

I am just having problems with jobs from one user.   When they delete
their jobs, the sge_shepherd stays running.  As root I tried to  
remedy the
situation by "qdel -f queueid".  That creates the second sge_shepherd
owned by root.  I  have only been able to delete the sge_shepherd jobs
created by this user's jobs by rebooting the compute  node.

The user is using g++
as their  compiler.  They are just submitting a stand-alone job using

qsub ./shell-s

where shell-s contains
#!/bin/bash

# job name
#$ -N C-256

# send the standard output to your current working directory
#$ -cwd

# define the name of your output file
#$ -o C-2e6.log
# merge error and stdout into a single file
#$ -j y

# Mail to user on a=abort, b=begin, e=end
#$ -m ae
## Can specify another e-mail address, if you'd like
#$ -M user at brown.edu

# estimated hardware runtime. make sure it is long enough to run your  
program
# but short enough to stop an unexpected long job. Below is the form  
of hour:minute
:second.
#$ -l h_rt=100:30:00

# Put in a timestamp
echo Starting execution at `date`

# run your code, you need to specify the absolute path for your  
program in bash she
ll
./Cexlg



I thought that the user in my case was  submitting jobs that had   
memory  requirements
way beyond the means of the system.   The  queued jobs were being  
shown as running with qstat and qmon, but top
showed them taking up no  CPU time.
"qstat -j  jobid" were showing memory requirements

usage    1:                 cpu=4:00:53:55, mem=4999.40220 GBs,  
io=0.00000, vmem=68.496M, maxvmem=68.496M

   The  user  could not ssh into  compute
nodes where they  had  submitted job, but could not ssh into compute
nodes where jobs were not submitted.  Other users had no  problems   
ssh'ing into
any of the nodes.

No one could complete a "df" command on the nodes running the queued  
jobs.
I can "ls /share/apps", but the command gets stuck on "ls /home."   
on  nodes with
queued jobs.

The  queued jobs were being shown as running with qstat and qmon, but  
top
showed them taking up no  CPU time.
"qstat -j  jobid" were showing memory requirements

usage    1:                 cpu=4:00:53:55, mem=4999.40220 GBs,  
io=0.00000, vmem=68.496M, maxvmem=68.496M


I rebooted the nodes  with which they were having difficulties.  The  
user
was the only one with submitted jobs on the compute node.
After the reboot the user  can ssh into any of the nodes.


I don't  understand the information that I get from "qstat -j"

I have a job which is running.  top shows it taking up 100% of a  
CPU.  "qstat -j" shows

usage    1:                 cpu=16:14:10, mem=233.43112 GBs,  
io=0.00000, vmem=57.906M, maxvmem=57.906M

What does the "mem" item mean?   This job seems to be running fine,  
but I only have 8 Gb of memory on this
compute node.  strace shows the job is  working.

share/apps/strace/strace -p 6515
Process 6515 attached - interrupt to quit
read(0, "     -0.36501\n      -10.07321   "..., 32768) = 32768
read(0, "6           0.59918\n        6.45"..., 32768) = 32768
read(0, "         -22.28450\n        0.486"..., 32768) = 32768
read(0, ".22488           1.03087\n       "..., 32768) = 32768
read(0, "     0.35352          -0.15990\n "..., 32768) = 32768
read(0, "           0.73432           0.1"..., 32768) = 32768
read(0, ".43226           0.32386        "..., 32768) = 32768
read(0, "     0.54479          -1.11933  "..., 32768) = 32768
read(0, "31\n       -0.38627          -0.7"..., 32768



On Jun 20, 2007, at 4:26 AM, Andy Schwierskott wrote:

> Hi,
>
> is the execd spool directory located on NFS? The most common reason  
> why a
> process can't be killed with SIGKILL is a NFS problem where a  
> process tries
> to make some I/O. Or more technically speaking: the process is in an
> 'uninterruptible sleep'. As far as I know a simple test case to  
> reprodcue
> this behavior is to mount an NFS file system with the "hard"  
> option, do some
> I/O,, e.g. do a long lasting "cat BIGFILE" and the unplug the  
> network cable
> or kill the NFS server: You won't be able to kill the "cat" process)
>
> Other reasons for not being able to kill a process could be kernel  
> bugs where
> a process stays in such an uninterruptible sleep whereit shouldn't.
>
> An "strace" on the shepherd processes might reveal what they  
> currently do.
>
> I'm kind of surprises that the sge_shepherd processes have  
> different owners
> - what's the background there?
>
> Andy
>
>> "kill -9" doesn't kill them.
>>
>> On Jun 19, 2007, at 12:40 PM, Valentin Ruano wrote:
>>
>>> Well, I reckon that you can always kill them individually using  
>>> the command KILL.
>>> First give them the chance to terminate themselves:
>>> $ kill -TERM <list of pids>
>>> If they resist then force the to die:
>>> $ kill -KILL <list of pids>
>>> V.
>>> On 6/19/07, Margaret Doll <Margaret_Doll at brown.edu> wrote:
>>> I have one user who submits jobs, sometimes deletes them and leaves
>>> the compute nodes full of sge_sheperd-nnn -bg jobs.
>>> [root at compute-0-1 ~]# ps -ef | grep sge*
>>> sge       4207     1  0 May09 ?        03:07:50 /opt/gridengine/bin/
>>> lx26-amd64/sge_execd
>>> sge      19994  4207  0 May23 ?        00:00:03 sge_shepherd-176 -bg
>>> sge      20070  4207  0 May23 ?        00:00:03 sge_shepherd-181 -bg
>>> sge      21361  4207  0 May24 ?        00:00:01 sge_shepherd-184 -bg
>>> nanguyen 21362 21361  0 May24 ?        00:00:00 sge_shepherd-184 -bg
>>> sge      28576  4207  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>>> nanguyen 28577 28576  0 Jun06 ?        00:00:00 sge_shepherd-286 -bg
>>> sge      28584  4207  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>>> nanguyen 28585 28584  0 Jun06 ?        00:00:00 sge_shepherd-288 -bg
>>> sge      28652  4207  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>>> nanguyen 28653 28652  0 Jun06 ?        00:00:00 sge_shepherd-297 -bg
>>> sge      31052  4207  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>>> nanguyen 31053 31052  0 Jun18 ?        00:00:00 sge_shepherd-470 -bg
>>> root      3220  3085  0 12:03 pts/1    00:00:00 grep sge*
>>> Until this are cleared from the node, jobs won't run.  I know that I
>>> can clear the jobs by rebooting the compute node, but there must  
>>> be a
>>> cleaner way of clearing the sge_shepard jobs.
>>> Any idea how the user is doing this?  Other users do not leave the
>>> sge_shepard jobs around.
>>> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list