[GE users] qmaster won't start, can't find culprit job

reuti reuti at staff.uni-marburg.de
Mon Jul 12 20:38:03 BST 2010


Hi,

Am 12.07.2010 um 19:27 schrieb jlb:

> On Sun, 11 Jul 2010 at 4:37pm, jlb wrote
>
>> 07/11/2010 10:44:17|qmaster|$HOST|I|read job database with 315  
>> entries in 6 seconds
>> 07/11/2010 10:44:17|qmaster|$HOST|C|!!!!!!!!!! JB_ja_tasks not  
>> found in element !!!!!!!!!!
>>
>> At that point, sge_qmaster dies.  sge_schedd still fires up, but,
>> obviously, is useless.  For the life of me, I can't figure out  
>> which job
>> is causing the problem.  I've tried all sorts of things to get SGE  
>> to be
>> more verbose, but to no avail.  How can I track down which job is  
>> causing
>> the problem?  I'd really rather not scrap the whole queue if it can  
>> be
>> avoided.
>
> I've found a significant number (80) of job IDs in
> $SGE_ROOT/$SGE_CELL/spool/qmaster/jobs which don't have  
> corresponding job
> scripts in $SGE_ROOT/$SGE_CELL/spool/qmaster/job_scripts.  Is that a
> normal situation?  I'm considering removing those files/directories  
> and
> restarting qmaster in the hope that I'll have gotten rid of the  
> offending
> job(s).  Bad idea?

I don't recall exactly. In 6.2 even the jobscripts are in the BDB. Do  
you have any backup of a working state of the installation?

-- Reuti


> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267565
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267601

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list