[GE users] 6.2u5: "failed to deliver signal 20 to job"

ccaamad m.c.dixon at leeds.ac.uk
Tue Feb 9 09:09:28 GMT 2010


On Mon, 8 Feb 2010, reuti wrote:
...
>> 02/08/2010 16:36:35|  main|c1s0b8n0|W|job 6018.1 exceeded hard
>> wallclock time - initiate terminate method
>> 02/08/2010 16:36:35|  main|c1s0b8n0|W|failed to deliver signal 20
>> to job 6018.1 task 1.c1s0b8n0 for KILL (shepherd with pid 420): No
>> such file or
>
> you redefined the warning signals for a kill to be sigtstp?
>
> qstat is still listing the job? Sometimes there are some files left
> in the subdirectory of the spool directory of the node reading .../
> jobs/00/0000/6018.1 which must be removed by hand to get rid of these
> messages.

Hi Reuti, many thanks for responding.

I've not done anything with signals, qstat is no longer listing the jobs 
and there are no files in the (local) spool directory of the node :(

$ find /var/spool/sge/c1s0b8n0 | xargs ls -ld
drwxr-xr-x 5 sge sge     4096 Jan 25 11:39 /var/spool/sge/c1s0b8n0
drwxr-xr-x 2 sge sge     4096 Feb  8 16:53 /var/spool/sge/c1s0b8n0/active_jobs
-rw-r--r-- 1 sge sge        4 Feb  8 14:46 /var/spool/sge/c1s0b8n0/execd.pid
drwxr-xr-x 2 sge sge     4096 Feb  8 16:53 /var/spool/sge/c1s0b8n0/jobs
drwxr-xr-x 2 sge sge     4096 Feb  5 15:29 /var/spool/sge/c1s0b8n0/job_scripts
-rw-r--r-- 1 sge sge 30865885 Feb  9 08:53 /var/spool/sge/c1s0b8n0/messages

The main wierd thing about the cluster is that SGE communicates with the 
qmaster using IP over the InfiniBand network.

/var/spool/sge/c1s0b8n0/messages is still filling-up with messages. A 
snippet from this morning is pasted at the end of this message. The 
shepherd pids mentioned don't exist. The only SGE process running is:

$ ps -ef | grep sge
sge        319     1  0 Feb08 ?        00:00:01 /services/sge/bin/lx24-amd64/sge_execd

At least some of these jobs have made it to the accounting file:

$ qacct -j 6032
==============================================================
qname        test.q
hostname     c1s0b11n1.arc1.leeds.ac.uk
group        users
owner        issmcd
project      ISS
department   defaultdepartment
jobname      fluent
jobnumber    6032
taskid       undefined
account      sge
priority     0
qsub_time    Mon Feb  8 16:01:14 2010
start_time   Mon Feb  8 16:01:27 2010
end_time     Mon Feb  8 16:03:26 2010
granted_pe   ib-nb-24-c1s0
slots        16
failed       0
exit_status  0
ru_wallclock 119
ru_utime     0.640
ru_stime     0.419
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    110602
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     20608
ru_nivcsw    818
cpu          2023.000
mem          2023.000
io           0.000
iow          0.000
maxvmem      17.000G
arid         undefined

...
>> I seem to have triggered this problem quite a bit: at one point the 
>> execd refused to start a new job because it had run out of group ids to 
>> use - until I restarted the daemon.
>
> Which range did you define for the additonal group ids?
...

gid_range is set to "20000-20100".

I really ought to spend some time reading the source code when the dust 
settles after getting this cluster into production. At the moment, my only 
quick fix is to restart sgeexecd on the compute node.

Thanks,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

02/09/2010 08:58:55|  main|c1s0b8n0|W|job 6045.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:58:55|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6045.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1382): No such file or directory
02/09/2010 08:59:15|  main|c1s0b8n0|W|job 6034.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:15|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6034.1 task 1.c1s0b8n0 for KILL (shepherd with pid 773): No such file or directory
02/09/2010 08:59:17|  main|c1s0b8n0|W|job 6040.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:17|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6040.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1124): No such file or directory
02/09/2010 08:59:19|  main|c1s0b8n0|W|job 6029.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:19|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6029.1 task 1.c1s0b8n0 for KILL (shepherd with pid 502): No such file or directory
02/09/2010 08:59:24|  main|c1s0b8n0|W|job 6018.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:24|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6018.1 task 1.c1s0b8n0 for KILL (shepherd with pid 420): No such file or directory
02/09/2010 08:59:27|  main|c1s0b8n0|W|job 6044.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:27|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6044.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1368): No such file or directory
02/09/2010 08:59:31|  main|c1s0b8n0|W|job 6042.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:31|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6042.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1243): No such file or directory
02/09/2010 08:59:33|  main|c1s0b8n0|W|job 6038.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:33|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6038.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1002): No such file or directory
02/09/2010 08:59:38|  main|c1s0b8n0|W|job 6032.1 exceeded hard wallclock time - initiate terminate method
02/09/2010 08:59:38|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6032.1 task 1.c1s0b8n0 for KILL (shepherd with pid 531): No such file or directory

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=244052

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list