[GE users] Problem with 6.1 after vacation

Brett_W_Grant at raytheon.com Brett_W_Grant at raytheon.com
Mon Aug 6 19:44:25 BST 2007


Of course upon return of my vacation, I am told that our cluster isn't 
working properly.  The thing is that it appears to be working properly, 
but I am not 100% sure that it is.  I am new on the administration bit of 
it, so I need a little bit of help.

I've got a mix of Apple G5 and DuoCore machines, 268 processors total. The 
qmaster runs on an G5 machine configured to be a head node.  All of the 
machines are running SGE 6.1.

99% of our submitted jobs are array jobs with between 100 and 20000 tasks. 
 The system was working fine for 2 weeks before I left on vacation.

The first complaint is that the system is not accepting jobs.  I don't see 
that behavior, so I will have to talk to the submitter who isn't currently 
here, so let us ignore this one.

The second complaint is that the array jobs are not being taken in order. 
For example, the output to qstat might so (I can't copy and paste from the 
original system, so I hand copied it):

        4018 0.45734 P4s4s vaa7748 qw 07/09/2007 14:33:27       1 
304,305,309,312-6000:1

Which seems a little odd, but the job eventually gets started.  It is just 
that I have never seen this type of output in running millions of jobs 
unless there is an error with one of the que instances.

qstat -f shows no ques in any type of error state, but not every job slot 
is taken and it can sit that way for a while.
qstat -g c shows no errors
qhost shows all of the hosts are connected an running

I go to $SGE_ROOT/default/spool/qmaster and looked at the messages file, 
which was huge (~ 0.5G).
Every 40 seconds there are a series of messages that say something like:

        08/06/2007 10:50:22|qmaster|rebel2|W|unable to find job "4838" 
from the ticket order

and there are a whole bunch of those, but the job name changes.  For each 
time period, it seems to be the same series of job names and then there is 
a message:

        08/06/2007 10:50:22|qmaster|rebel2|E|execd at drone040.local reports 
running job (4822.1634/master) in queue 
"priority at drone040.echobase.cluster" that was not supposed to be there - 
killing
        08/06/2007 10:50:22|qmaster|rebel2|E|execd at drone026.local reports 
running job (4822.2547/master) in queue 
"priority at drone026.echobase.cluster" that was not supposed to be there - 
killing
        08/06/2007 10:50:42|qmaster|rebel2|W|scheduler send a order for a 
changed user/project "vaa2088" (version: old 40724) new 40725
        08/06/2007 10:50:42|qmaster|rebel2|W|scheduler send a order for a 
changed user/project "vaa7748" (version: old 119928) new 119929
        08/06/2007 10:50:42|qmaster|rebel2|W|unable to find job 4832 from 
the scheduler order package
        08/06/2007 10:50:42|qmaster|rebel2|W|unable to find job 4832 from 
the scheduler order package
        08/06/2007 10:50:42|qmaster|rebel2|W|unable to find job 4832 from 
the scheduler order package
        08/06/2007 10:50:42|qmaster|rebel2|W|unable to find job 4832 from 
the scheduler order package

and then it seems to repeat itself, except sometimes the hosts change in 
the execd message.
I can't find any of the job names or ids that it lists in the message 
file.  I go to the spool directories for the hosts listed (ie drone026) 
and look at the messages the only error that I see is that it couldn't 
open file active_jobs:4822.2630/error: No such file or directory.

I thought that this might be an NFS issue, but I don't see any problems in 
the system log.

The nature of the jobs is such that I don't want to restart the qmaster 
unless I have to.

I have a feeling that someone submitted some jobs, deleted the directories 
that they were writing to, and then deleted the jobs, but I can't say for 
sure that is what happened.  I did a search for these errors in the users 
list, but I didn't see anything applicable.

Any suggestions?

Thanks,
Brett Grant



More information about the gridengine-users mailing list