[GE users] Interesting Problems

pollinger harald.pollinger at sun.com
Fri May 22 12:47:57 BST 2009


Hi Daniel,

yes, I can confirm this error. Please file an issue for it.
The KEEP_ALIVE problem is not directly related to this bug, it's a known 
problem, IIRC. If there doesn't exist an issue for it, please file a 
separate one.

Thanks!
Harald


templedf wrote:
> Could someone please verify that the following can be reproduced?  I'm 
> using the 6.2u3 beta for a 1-node test cluster.  I do the following:
> 
> 0) Submit a job to the only queue and watch it run ok -- no issues with 
> all.q
> 1) Create a second queue
> 2) Set the second queue's prolog to /dev/null -- yes, I want jobs to 
> fail there
> 3) Submit a job with a soft request for the second queue
> 4) qstat -f
> 
> Here's what I see:
> 
>  > qstat -f
> queuename                      qtype resv/used/tot. load_avg 
> arch          states
> ---------------------------------------------------------------------------------
> all.q at ultra20                  BIP   0/0/8          0.18     sol-amd64     E
> ---------------------------------------------------------------------------------
> test.q at ultra20                 BIP   0/0/8          0.18     sol-amd64     E
> 
> ############################################################################
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
> ############################################################################
>     112 0.55500 STDIN      dant         qw    05/15/2009 06:35:17     1
> 
> Both queues are in error state!  That test.q should be in error state is 
> expected because of the prolog, but I know that all.q is fine.  I tested 
> both before and after this experiment, and as long as the job goes to 
> all.q first, there are no problems.  Looking at the qstat -j output for 
> the job only shows why test.q went into error:
> 
>  > qstat -j 112
> ==============================================================
> job_number:                 112
> ...
> error reason    1:          05/15/2009 06:35:17 [40240:7813]: prolog 
> file "/dev/null" is not executable
> scheduling info:            (Collecting of scheduler job information is 
> turned off)
> 
> So, next I:
> 
> 5) qmod -d test.q
> 6) qmod -cq all.q
> 
> What happens next depends on whether KEEP_ACTIVE is set or not.  If it's 
> not set, then when the job is rescheduled, it runs in all.q without 
> error.  If KEEP_ACTIVE is set, after the job is rescheduled (and sits 
> for a really long time in the t state), I get this again:
> 
>  > qstat -f
> queuename                      qtype resv/used/tot. load_avg 
> arch          states
> ---------------------------------------------------------------------------------
> all.q at ultra20                  BIP   0/0/8          0.18     sol-amd64     E
> ---------------------------------------------------------------------------------
> test.q at ultra20                 BIP   0/0/8          0.18     
> sol-amd64     dE
> 
> but this time, qstat -j tells me:
> 
>  > qstat -j 112
> ==============================================================
> job_number:                 112
> ...
> error reason    1:          05/15/2009 06:35:17 [40240:7813]: prolog 
> file "/dev/null" is not executable
>                 1:          can't create directory active_jobs/112.1: 
> File exists
> scheduling info:            (Collecting of scheduler job information is 
> turned off)
> 
> 
> So...  There are a number of issues here. 1) It looks like the job is 
> being rescheduled before the active_jobs directory can be cleaned up.  I 
> have flush_*_sec set to 1.  2) The qstat -j output is not picking up the 
> reason for the second queue's failure.  3) With KEEP_ACTIVE=TRUE, a job 
> can't be scheduled to the same host twice, so it's important not to 
> leave it set to true in a production cluster.
> 
> Daniel


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         Sun Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=198239

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list