[GE users] SGE6 does not backfill

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Apr 11 08:40:13 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

we discovered a bug in the resource reservation code, which prevents
backfilling.

Could you verify, if you are running into the bug?

[Issue 1543]  reserved jobs prevent other jobs from starting

Cheers,
Stephan

Reuti wrote:

>Quoting Juha Jäykkä <juhaj at iki.fi>:
>
>  
>
>>>The jobs still running have a h_rt set, and the new submitted serial
>>>ones also?
>>>      
>>>
>>The ones in queue do have, but I did not check all new serial jobs. What I
>>did, is I submitted one with -l h_rt=00:01:00 just for testing. It does
>>not backfill.
>>
>>    
>>
>>>I wonder, why all of your queues are dropped. qstat -f shows that all
>>>are empty  besides the three running ones?
>>>      
>>>
>>They are empty. Now there is one "quirk" we did: all.q at topaasi.local has
>>slots=0 in it, but I cannot see how that could affect it.
>>
>>Should the queues be dropped only in case there is actually someone eating
>>its CPU or should the resource resesrvation cause it to be dropped, too?
>>    
>>
>
>For me it's only dropped, if there is something running in all slots of a 
>queue. But not for a reservation. - Reuti
>
>  
>
>>You notice, the top-job in the queue wants all the resources of the
>>cluster and thus it reserves all the queues. But this does not change the
>>behaviour: I changed the priorities of the jobs and put a smaller job on
>>top - same behaviour.
>>
>>Am I missing some config parameter or something? Urgency value for h_rt,
>>perhaps?
>>
>>    
>>
>>>>Sun Apr 10 18:34:53 2005|-------------START-SCHEDULER-RUN-------------
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at topaasi.local" dropped
>>>>because it is full
>>>>Sun Apr 10 18:34:53 2005|queues dropped because they are full:
>>>>all.q at topaasi.local
>>>>Sun Apr 10 18:34:53 2005|Job 182 cannot run because available slots
>>>>combined under PE "lam" are not in range of job
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-6.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-10.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-8.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-3.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-9.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-2.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-0.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-4.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-5.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-7.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-11.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at compute-0-1.local"
>>>>dropped because it is full
>>>>Sun Apr 10 18:34:53 2005|queue instance "all.q at topaasi.local" dropped
>>>>because it is full
>>>>Sun Apr 10 18:34:53 2005|queues dropped because they are full:
>>>>all.q at compute-0-6.local all.q at compute-0-10.local
>>>>all.q at compute-0-8.local all.q at compute-0-3.local
>>>>all.q at compute-0-9.local all.q at compute-0-2.local
>>>>all.q at compute-0-0.local all.q at compute-0-4.local
>>>>all.q at compute-0-5.local all.q at compute-0-7.local
>>>>Sun Apr 10 18:34:53 2005|queues dropped because they are full:
>>>>all.q at compute-0-11.local all.q at compute-0-1.local all.q at topaasi.local
>>>>Sun Apr 10 18:34:53 2005|--------------STOP-SCHEDULER-RUN-------------
>>>>
>>>>BTW, the queues have changed a little since my last mail, by the
>>>>situation at the moment is, there are three jobs running and job 182
>>>>requests 24 (=all) CPU's. The jobs have still many hours until their
>>>>h_rt runs out.
>>>>
>>>>In fact, I have never seen SGE6 backfill anything yet... I have only
>>>>seen the highest priority job being dispatched. What worries me here
>>>>is, that in the snipped from /sgeroot/cell/common/scheduler I sent in
>>>>the first mail, reservations are only done for the first job and the
>>>>rest of the jobs are apparently not even considered!
>>>>
>>>>-- 
>>>>		 -----------------------------------------------
>>>>		| Juha Jäykkä, juolja at utu.fi			|
>>>>		| home: http://www.utu.fi/~juolja/		|
>>>>		 -----------------------------------------------
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>      
>>>
>>-- 
>>		 -----------------------------------------------
>>		| Juha Jäykkä, juolja at utu.fi			|
>>		| home: http://www.utu.fi/~juolja/		|
>>		 -----------------------------------------------
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list