[GE users] 6.2u5: "failed to deliver signal 20 to job"

reuti reuti at staff.uni-marburg.de
Wed Feb 10 14:51:48 GMT 2010


Hi,

Am 09.02.2010 um 10:09 schrieb ccaamad:

> On Mon, 8 Feb 2010, reuti wrote:
> ...
>>> 02/08/2010 16:36:35|  main|c1s0b8n0|W|job 6018.1 exceeded hard
>>> wallclock time - initiate terminate method
>>> 02/08/2010 16:36:35|  main|c1s0b8n0|W|failed to deliver signal 20
>>> to job 6018.1 task 1.c1s0b8n0 for KILL (shepherd with pid 420): No
>>> such file or
>>
>> you redefined the warning signals for a kill to be sigtstp?
>>
>> qstat is still listing the job? Sometimes there are some files left
>> in the subdirectory of the spool directory of the node reading .../
>> jobs/00/0000/6018.1 which must be removed by hand to get rid of these
>> messages.
>
> Hi Reuti, many thanks for responding.
>
> I've not done anything with signals, qstat is no longer listing the  
> jobs
> and there are no files in the (local) spool directory of the node :(
>
> $ find /var/spool/sge/c1s0b8n0 | xargs ls -ld
> drwxr-xr-x 5 sge sge     4096 Jan 25 11:39 /var/spool/sge/c1s0b8n0
> drwxr-xr-x 2 sge sge     4096 Feb  8 16:53 /var/spool/sge/c1s0b8n0/ 
> active_jobs
> -rw-r--r-- 1 sge sge        4 Feb  8 14:46 /var/spool/sge/c1s0b8n0/ 
> execd.pid
> drwxr-xr-x 2 sge sge     4096 Feb  8 16:53 /var/spool/sge/c1s0b8n0/ 
> jobs
> drwxr-xr-x 2 sge sge     4096 Feb  5 15:29 /var/spool/sge/c1s0b8n0/ 
> job_scripts
> -rw-r--r-- 1 sge sge 30865885 Feb  9 08:53 /var/spool/sge/c1s0b8n0/ 
> messages
>
> The main wierd thing about the cluster is that SGE communicates  
> with the
> qmaster using IP over the InfiniBand network.

are you using only IB, or is there also an Ethernet connection? You  
could try to route SGE over a secondary interface, as it's not so  
much traffic.


> <snip>
> gid_range is set to "20000-20100".

this should be fine.


> I really ought to spend some time reading the source code when the  
> dust
> settles after getting this cluster into production. At the moment,  
> my only
> quick fix is to restart sgeexecd on the compute node.

Yeah, there are not so many places where the sigtstp is send.

-- Reuti


> Thanks,
>
> Mark
> -- 
> -----------------------------------------------------------------
> Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
>
> 02/09/2010 08:58:55|  main|c1s0b8n0|W|job 6045.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:58:55|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6045.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1382): No  
> such file or directory
> 02/09/2010 08:59:15|  main|c1s0b8n0|W|job 6034.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:15|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6034.1 task 1.c1s0b8n0 for KILL (shepherd with pid 773): No  
> such file or directory
> 02/09/2010 08:59:17|  main|c1s0b8n0|W|job 6040.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:17|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6040.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1124): No  
> such file or directory
> 02/09/2010 08:59:19|  main|c1s0b8n0|W|job 6029.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:19|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6029.1 task 1.c1s0b8n0 for KILL (shepherd with pid 502): No  
> such file or directory
> 02/09/2010 08:59:24|  main|c1s0b8n0|W|job 6018.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:24|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6018.1 task 1.c1s0b8n0 for KILL (shepherd with pid 420): No  
> such file or directory
> 02/09/2010 08:59:27|  main|c1s0b8n0|W|job 6044.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:27|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6044.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1368): No  
> such file or directory
> 02/09/2010 08:59:31|  main|c1s0b8n0|W|job 6042.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:31|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6042.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1243): No  
> such file or directory
> 02/09/2010 08:59:33|  main|c1s0b8n0|W|job 6038.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:33|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6038.1 task 1.c1s0b8n0 for KILL (shepherd with pid 1002): No  
> such file or directory
> 02/09/2010 08:59:38|  main|c1s0b8n0|W|job 6032.1 exceeded hard  
> wallclock time - initiate terminate method
> 02/09/2010 08:59:38|  main|c1s0b8n0|W|failed to deliver signal 20  
> to job 6032.1 task 1.c1s0b8n0 for KILL (shepherd with pid 531): No  
> such file or directory
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=244052
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=244256

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list