[GE users] SGE6 does not backfill

Reuti reuti at staff.uni-marburg.de
Sun Apr 10 13:58:01 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

what is "qconf -tsm" giving in the /usr/sge/default/common/schedd_runlog?

CU - Reuti


Quoting Juha Jäykkä <juhaj at iki.fi>:

> I have the following setup (only backfilling -relevant parameters shown
> here):
> 
> max_reservation                   100
> default_duration                  337:0:0
> 
> Now, there are 24 CPUs on 12 identical nodes, here is the queue.
> 
> job-ID  prior   name       user         state submit/start at     queue      
>                    slots ja-task-ID 
> 
-------------------------------------------------------------------------------
----------------------------------
>     169 0.52752 co7Lf207   A      r     04/07/2005 17:43:39
> all.q at compute-0-0.local            1        
>     172 0.52724 co9Lf208   A      r     04/08/2005 00:44:39
> all.q at compute-0-0.local            1        
>      34 0.60500 co1Lf20    A      r     04/04/2005 16:38:56
> all.q at compute-0-1.local            1        
>      35 0.60500 co2Lf20    A      r     04/04/2005 16:39:11
> all.q at compute-0-1.local            1        
>     165 0.52984 co0Lf20    A      r     04/07/2005 17:08:54
> all.q at compute-0-10.local           1        
>     183 0.52562 nagpd9020  B      r     04/08/2005 11:57:52
> all.q at compute-0-11.local           1        
>     137 0.53453 nagpd8090  B      r     04/07/2005 11:03:09
> all.q at compute-0-2.local            1        
>     173 0.52724 co9Lf209   A      r     04/08/2005 01:22:54
> all.q at compute-0-3.local            1        
>      36 0.60500 co3Lf20    A      r     04/06/2005 09:54:09
> all.q at compute-0-6.local            1        
>     168 0.52771 co2Lf209   A      r     04/07/2005 17:32:54
> all.q at compute-0-7.local            1        
>     184 0.52555 nagpd10020 B      r     04/07/2005 20:06:24
> all.q at compute-0-7.local            1        
>     144 0.53410 nagpd8010  B      r     04/07/2005 11:30:09
> all.q at compute-0-8.local            1        
>     142 0.53415 nagpd3020  B      r     04/07/2005 17:04:24
> all.q at compute-0-9.local            1        
>     167 0.52771 co2Lf208   A      r     04/07/2005 17:32:39
> all.q at compute-0-9.local            1        
>     182 0.52708 GLtest_226 C      qw    04/07/2005 18:09:44                  
>                 24        
>     181 0.52708 GLtest_225 C      qw    04/07/2005 18:09:41                  
>                 20        
>     177 0.52707 GLtest_221 C      qw    04/07/2005 18:09:12                  
>                  4        
>     178 0.52707 GLtest_222 C      qw    04/07/2005 18:09:20                  
>                  8        
>     180 0.52707 GLtest_224 C      qw    04/07/2005 18:09:36                  
>                 16        
>     179 0.52707 GLtest_223 C      qw    04/07/2005 18:09:33                  
>                 12        
>     190 0.50500 co0Lf202   A      qw    04/08/2005 15:10:25                  
>                  1        
>     191 0.50500 co0Lf204   A      qw    04/08/2005 15:10:28                  
>                  1        
>     192 0.50500 co0Lf206   A      qw    04/08/2005 15:10:31                  
>                  1        
>     193 0.50500 co0Lf207   A      qw    04/08/2005 15:10:33                  
>                  1        
>     194 0.50500 co0Lf208   A      qw    04/08/2005 15:10:36                  
>                  1        
>     195 0.50500 co0Lf209   A      qw    04/08/2005 15:10:40                  
>                  1        
> 
> 
> Now, all the jobs currently running have h_rt values which tell the
> scheduler they won't finish until tomorrow evening. All the parallel jobs
> in the queue have been submitted with -R y in order to reserve the CPUs
> for them. Everything is fine, exept that there are 10 free CPUs which no
> one is using. The parallel jobs only request 2 hours of CPU time each and
> I even tested with a serial job which requests just 10 minutes, but
> nothing gets backfilled!
> 
> What is wrong here? Am I missing some parameter somewhere? The only place
> where the manual talks about backfilling is in sched_conf's man page and
> concerns the two options I have mentioned in the beginning. So as far as I
> can tell (from the documentation), backfilling should occur!
> 
> Here is what "schedule" says:
> 
> 34:1:RUNNING:1112621936:1213200:Q:all.q at compute-0-1.local:slots:1.000000
> 35:1:RUNNING:1112621951:1213200:Q:all.q at compute-0-1.local:slots:1.000000
> 36:1:RUNNING:1112770449:1213200:Q:all.q at compute-0-6.local:slots:1.000000
> 137:1:RUNNING:1112860989:1213200:Q:all.q at compute-0-2.local:slots:1.000000
> 144:1:RUNNING:1112862609:360000:Q:all.q at compute-0-8.local:slots:1.000000
> 142:1:RUNNING:1112882664:1213200:Q:all.q at compute-0-9.local:slots:1.000000
> 165:1:RUNNING:1112882934:1213200:Q:all.q at compute-0-10.local:slots:1.000000
> 167:1:RUNNING:1112884359:172800:Q:all.q at compute-0-9.local:slots:1.000000
> 168:1:RUNNING:1112884374:172800:Q:all.q at compute-0-7.local:slots:1.000000
> 169:1:RUNNING:1112885019:172800:Q:all.q at compute-0-0.local:slots:1.000000
> 184:1:RUNNING:1112893584:360000:Q:all.q at compute-0-7.local:slots:1.000000
> 172:1:RUNNING:1112910279:172800:Q:all.q at compute-0-0.local:slots:1.000000
> 173:1:RUNNING:1112912574:172800:Q:all.q at compute-0-3.local:slots:1.000000
> 183:1:RUNNING:1112950672:345600:Q:all.q at compute-0-11.local:slots:1.000000
> 182:1:RESERVING:1114096134:7200:P:lam:slots:24.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-2.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-3.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-6.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-8.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-11.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-5.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-10.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-4.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-0.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-1.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-7.local:slots:2.000000
> 182:1:RESERVING:1114096134:7200:Q:all.q at compute-0-9.local:slots:2.000000
> 
> I can see from this, that the resources are indeed reserved (and the fact
> that the smaller jobs do not get run concurs with this).
> 
> --
>                  ---------------------------------------------
>                 | Juha Jäykkä, juolja at utu.fi			|
> 		| Laboratory of Theoretical Physics		|
> 		| Department of Physics, University of Turku	|
>                 | home: http://www.utu.fi/~juolja/              |
>                  -----------------------------------------------
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list