[GE users] SGE6 does not backfill

Juha Jäykkä juhaj at iki.fi
Sun Apr 10 10:02:42 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I have the following setup (only backfilling -relevant parameters shown
here):

max_reservation                   100
default_duration                  337:0:0

Now, there are 24 CPUs on 12 identical nodes, here is the queue.

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    169 0.52752 co7Lf207   A      r     04/07/2005 17:43:39 all.q at compute-0-0.local            1        
    172 0.52724 co9Lf208   A      r     04/08/2005 00:44:39 all.q at compute-0-0.local            1        
     34 0.60500 co1Lf20    A      r     04/04/2005 16:38:56 all.q at compute-0-1.local            1        
     35 0.60500 co2Lf20    A      r     04/04/2005 16:39:11 all.q at compute-0-1.local            1        
    165 0.52984 co0Lf20    A      r     04/07/2005 17:08:54 all.q at compute-0-10.local           1        
    183 0.52562 nagpd9020  B      r     04/08/2005 11:57:52 all.q at compute-0-11.local           1        
    137 0.53453 nagpd8090  B      r     04/07/2005 11:03:09 all.q at compute-0-2.local            1        
    173 0.52724 co9Lf209   A      r     04/08/2005 01:22:54 all.q at compute-0-3.local            1        
     36 0.60500 co3Lf20    A      r     04/06/2005 09:54:09 all.q at compute-0-6.local            1        
    168 0.52771 co2Lf209   A      r     04/07/2005 17:32:54 all.q at compute-0-7.local            1        
    184 0.52555 nagpd10020 B      r     04/07/2005 20:06:24 all.q at compute-0-7.local            1        
    144 0.53410 nagpd8010  B      r     04/07/2005 11:30:09 all.q at compute-0-8.local            1        
    142 0.53415 nagpd3020  B      r     04/07/2005 17:04:24 all.q at compute-0-9.local            1        
    167 0.52771 co2Lf208   A      r     04/07/2005 17:32:39 all.q at compute-0-9.local            1        
    182 0.52708 GLtest_226 C      qw    04/07/2005 18:09:44                                   24        
    181 0.52708 GLtest_225 C      qw    04/07/2005 18:09:41                                   20        
    177 0.52707 GLtest_221 C      qw    04/07/2005 18:09:12                                    4        
    178 0.52707 GLtest_222 C      qw    04/07/2005 18:09:20                                    8        
    180 0.52707 GLtest_224 C      qw    04/07/2005 18:09:36                                   16        
    179 0.52707 GLtest_223 C      qw    04/07/2005 18:09:33                                   12        
    190 0.50500 co0Lf202   A      qw    04/08/2005 15:10:25                                    1        
    191 0.50500 co0Lf204   A      qw    04/08/2005 15:10:28                                    1        
    192 0.50500 co0Lf206   A      qw    04/08/2005 15:10:31                                    1        
    193 0.50500 co0Lf207   A      qw    04/08/2005 15:10:33                                    1        
    194 0.50500 co0Lf208   A      qw    04/08/2005 15:10:36                                    1        
    195 0.50500 co0Lf209   A      qw    04/08/2005 15:10:40                                    1        


Now, all the jobs currently running have h_rt values which tell the
scheduler they won't finish until tomorrow evening. All the parallel jobs
in the queue have been submitted with -R y in order to reserve the CPUs
for them. Everything is fine, exept that there are 10 free CPUs which no
one is using. The parallel jobs only request 2 hours of CPU time each and
I even tested with a serial job which requests just 10 minutes, but
nothing gets backfilled!

What is wrong here? Am I missing some parameter somewhere? The only place
where the manual talks about backfilling is in sched_conf's man page and
concerns the two options I have mentioned in the beginning. So as far as I
can tell (from the documentation), backfilling should occur!

Here is what "schedule" says:

34:1:RUNNING:1112621936:1213200:Q:all.q at compute-0-1.local:slots:1.000000
35:1:RUNNING:1112621951:1213200:Q:all.q at compute-0-1.local:slots:1.000000
36:1:RUNNING:1112770449:1213200:Q:all.q at compute-0-6.local:slots:1.000000
137:1:RUNNING:1112860989:1213200:Q:all.q at compute-0-2.local:slots:1.000000
144:1:RUNNING:1112862609:360000:Q:all.q at compute-0-8.local:slots:1.000000
142:1:RUNNING:1112882664:1213200:Q:all.q at compute-0-9.local:slots:1.000000
165:1:RUNNING:1112882934:1213200:Q:all.q at compute-0-10.local:slots:1.000000
167:1:RUNNING:1112884359:172800:Q:all.q at compute-0-9.local:slots:1.000000
168:1:RUNNING:1112884374:172800:Q:all.q at compute-0-7.local:slots:1.000000
169:1:RUNNING:1112885019:172800:Q:all.q at compute-0-0.local:slots:1.000000
184:1:RUNNING:1112893584:360000:Q:all.q at compute-0-7.local:slots:1.000000
172:1:RUNNING:1112910279:172800:Q:all.q at compute-0-0.local:slots:1.000000
173:1:RUNNING:1112912574:172800:Q:all.q at compute-0-3.local:slots:1.000000
183:1:RUNNING:1112950672:345600:Q:all.q at compute-0-11.local:slots:1.000000
182:1:RESERVING:1114096134:7200:P:lam:slots:24.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-2.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-3.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-6.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-8.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-11.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-5.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-10.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-4.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-0.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-1.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-7.local:slots:2.000000
182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-9.local:slots:2.000000

I can see from this, that the resources are indeed reserved (and the fact
that the smaller jobs do not get run concurs with this).

--
                 ---------------------------------------------
                | Juha Jäykkä, juolja at utu.fi			|
		| Laboratory of Theoretical Physics		|
		| Department of Physics, University of Turku	|
                | home: http://www.utu.fi/~juolja/              |
                 -----------------------------------------------


    [ Part 2, Application/PGP-SIGNATURE (Name: "MISSING_PARAMETER_VALUE") ]
    [ 196 bytes. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list