[GE users] Interesting Problems

templedf dan.templeton at sun.com
Fri May 15 14:56:25 BST 2009


Could someone please verify that the following can be reproduced?  I'm 
using the 6.2u3 beta for a 1-node test cluster.  I do the following:

0) Submit a job to the only queue and watch it run ok -- no issues with 
all.q
1) Create a second queue
2) Set the second queue's prolog to /dev/null -- yes, I want jobs to 
fail there
3) Submit a job with a soft request for the second queue
4) qstat -f

Here's what I see:

 > qstat -f
queuename                      qtype resv/used/tot. load_avg 
arch          states
---------------------------------------------------------------------------------
all.q at ultra20                  BIP   0/0/8          0.18     sol-amd64     E
---------------------------------------------------------------------------------
test.q at ultra20                 BIP   0/0/8          0.18     sol-amd64     E

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    112 0.55500 STDIN      dant         qw    05/15/2009 06:35:17     1

Both queues are in error state!  That test.q should be in error state is 
expected because of the prolog, but I know that all.q is fine.  I tested 
both before and after this experiment, and as long as the job goes to 
all.q first, there are no problems.  Looking at the qstat -j output for 
the job only shows why test.q went into error:

 > qstat -j 112
==============================================================
job_number:                 112
...
error reason    1:          05/15/2009 06:35:17 [40240:7813]: prolog 
file "/dev/null" is not executable
scheduling info:            (Collecting of scheduler job information is 
turned off)

So, next I:

5) qmod -d test.q
6) qmod -cq all.q

What happens next depends on whether KEEP_ACTIVE is set or not.  If it's 
not set, then when the job is rescheduled, it runs in all.q without 
error.  If KEEP_ACTIVE is set, after the job is rescheduled (and sits 
for a really long time in the t state), I get this again:

 > qstat -f
queuename                      qtype resv/used/tot. load_avg 
arch          states
---------------------------------------------------------------------------------
all.q at ultra20                  BIP   0/0/8          0.18     sol-amd64     E
---------------------------------------------------------------------------------
test.q at ultra20                 BIP   0/0/8          0.18     
sol-amd64     dE

but this time, qstat -j tells me:

 > qstat -j 112
==============================================================
job_number:                 112
...
error reason    1:          05/15/2009 06:35:17 [40240:7813]: prolog 
file "/dev/null" is not executable
                1:          can't create directory active_jobs/112.1: 
File exists
scheduling info:            (Collecting of scheduler job information is 
turned off)


So...  There are a number of issues here. 1) It looks like the job is 
being rescheduled before the active_jobs directory can be cleaned up.  I 
have flush_*_sec set to 1.  2) The qstat -j output is not picking up the 
reason for the second queue's failure.  3) With KEEP_ACTIVE=TRUE, a job 
can't be scheduled to the same host twice, so it's important not to 
leave it set to true in a production cluster.

Daniel

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=195920

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list