[GE users] qmaster won't start, can't find culprit job

jlb jlb at salilab.org
Mon Jul 12 00:37:53 BST 2010


I'm running 6.1u3 on a fully patched CentOS 5 x86_64 system.  Late on 
Friday night, the sge_schedd process died.  This isn't exactly uncommon 
for us (and, yes, an upgrade to 6.2 is in the works).  The usual fix is to 
simply 'service sgemaster stop; service sgemaster start' (and then, 
sometimes, find the array job that's burning through hundreds of tasks per 
second and kill it).

This time, when I tried to restart sgemaster, I got the following error in 
the qmaster messages file:

07/11/2010 10:44:17|qmaster|$HOST|I|read job database with 315 entries in 6 seconds
07/11/2010 10:44:17|qmaster|$HOST|C|!!!!!!!!!! JB_ja_tasks not found in element !!!!!!!!!!

At that point, sge_qmaster dies.  sge_schedd still fires up, but, 
obviously, is useless.  For the life of me, I can't figure out which job 
is causing the problem.  I've tried all sorts of things to get SGE to be 
more verbose, but to no avail.  How can I track down which job is causing 
the problem?  I'd really rather not scrap the whole queue if it can be 
avoided.

Thanks.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267408

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list