[GE users] qmaster won't start, can't find culprit job

jlb jlb at salilab.org
Mon Jul 12 18:27:33 BST 2010

On Sun, 11 Jul 2010 at 4:37pm, jlb wrote

> 07/11/2010 10:44:17|qmaster|$HOST|I|read job database with 315 entries in 6 seconds
> 07/11/2010 10:44:17|qmaster|$HOST|C|!!!!!!!!!! JB_ja_tasks not found in element !!!!!!!!!!
> At that point, sge_qmaster dies.  sge_schedd still fires up, but,
> obviously, is useless.  For the life of me, I can't figure out which job
> is causing the problem.  I've tried all sorts of things to get SGE to be
> more verbose, but to no avail.  How can I track down which job is causing
> the problem?  I'd really rather not scrap the whole queue if it can be
> avoided.

I've found a significant number (80) of job IDs in 
$SGE_ROOT/$SGE_CELL/spool/qmaster/jobs which don't have corresponding job 
scripts in $SGE_ROOT/$SGE_CELL/spool/qmaster/job_scripts.  Is that a 
normal situation?  I'm considering removing those files/directories and 
restarting qmaster in the hope that I'll have gotten rid of the offending 
job(s).  Bad idea?

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list