[GE users] 6.2u5: "failed to deliver signal 20 to job"

ccaamad m.c.dixon at leeds.ac.uk
Mon Feb 8 16:42:00 GMT 2010


I was wondering if anyone else had seen this. I've been doing some testing 
trying to get a new parallel application running with my 6.2u5 cluster. It 
seems that my execd's have got a bit confused and endlessly keep 
printing-out things like:

02/08/2010 16:36:35|  main|c1s0b8n0|W|job 6018.1 exceeded hard wallclock time - initiate terminate method
02/08/2010 16:36:35|  main|c1s0b8n0|W|failed to deliver signal 20 to job 6018.1 task 1.c1s0b8n0 for KILL (shepherd with pid 420): No such file or directory

The job and shepherd have already finished, but the execd seems to have 
trouble forgetting about them - it keeps printing the message every couple 
of minutes.

I seem to have triggered this problem quite a bit: at one point the execd 
refused to start a new job because it had run out of group ids to use - 
until I restarted the daemon.

Any ideas?

Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list