[GE users] resource reservation not working

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Thu Sep 20 10:49:21 BST 2007


Hi Ross,

On Wed, 19 Sep 2007, Ross Dickson wrote:

> Hi Andreas.
>
>> qconf -ssconf | egrep "weight_urgency|weight_priority|weight_ticket"
> weight_tickets_functional         0
> weight_tickets_share              0
> weight_ticket                     0.010000
> weight_urgency                    0.100000
> weight_priority                   1.000000

Ok.

>
> The only thing in the complex with an urgency value is "slots"...
>> qconf -sc | grep slots
> slots               s          INT         <=    YES         YES        1 
> 1000
> ...everything else has zero.

Ok.

> I can't think of any way that the smaller jobs could have been higher 
> priority than #3568, but I'm pretty new at this and many things are still 
> obscure to me.  Job priorities, in my experience, are determined by the slot 
> count (more slots --> higher priority).

Yes, this is how it works.

> We did nothing like "qalter -p" that 
> would either lower 3568 or raise the other jobs' posix priority.
>
> There are 4 (and only 4) effectively identical jobs queued up with "-R y". 
> Only one is showing reservations in the "schedule" file, but that doesn't 
> trouble me.  If I could get one of them going it would at least demonstrate 
> that reservation works.

No reservations is fine only, if the cluster is full, meaning 
that there are no queues available with free slots anymore.

I found recently a case where reservation is broken

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2344

but there might be other reasons for the behaviour you encounter.

How about your job runtimes?

---> Are they enforced or not?

According the 'schedule' file job 3568 has a runtime of just 10 minutes. 
Is it really that short? How about the runtimes of the smaller jobs?

Regards,
Andreas

>
> Cheers,
> Ross
>
>
> Andreas.Haas at Sun.COM wrote:
>> Hi Ross,
>> 
>> are you sure 3568 had higher priority also at the time when these smaller 
>> jobs were assigned? Could it be 3568 got no reservation in the meantime due 
>> to small max_reservation of 5? How you ensure jobs like 3568 get high 
>> priority? I would expect you are using urgency contribution of 1000 for the
>> 'slots' resource. Is there any other resource with a significant urgency 
>> contribution?
>> 
>> What weights are you using for priorities:
>>
>>  # qconf -ssconf | egrep "weight_urgency|weight_priority|weight_ticket"
>> 
>> Regards,
>> Andreas
>> 
>> 
>> On Wed, 19 Sep 2007, Ross Dickson wrote:
>> 
>>> Hello all.
>>> 
>>> We've got a Red Hat cluster running N1GE 6.0u9.  We've got resource 
>>> reservation turned on:
>>> 
>>> % qconf -ssconf | grep reservation
>>> max_reservation                   5
>>> 
>>> ...and four jobs in the waiting list with "-R y".  Here's one:
>>> 
>>> % qstat -j 3568 | grep reserv
>>> reserve:                    y
>>> 
>>> But since it went in on Sept 14, other jobs (of lower priority!) have been 
>>> submitted and scheduled. Here are some highlights from qstat:
>>> 
>>> job-ID  prior   name       user         state submit/start at     queue 
>>> slots ja-task-ID
>>> 
>>> ----------------------------------------------------------------------------------------------------------------- 
>>> ....
>>>  3566 0.52079 rs1.90_cmc itamblyn     r     09/18/2007 11:04:59 
>>> all.q at cl026.smu.acenet.ca          4
>>>  3668 0.52079 L099A      mcoates      r     09/18/2007 12:27:44 
>>> all.q at cl027.smu.acenet.ca          4
>>>  3563 0.52079 rs1.90_cmc itamblyn     r     09/13/2007 15:52:23 
>>> all.q at cl028.smu.acenet.ca          4
>>>  3667 0.52079 L022       mcoates      r     09/18/2007 12:27:44 
>>> all.q at cl029.smu.acenet.ca          4
>>> ....
>>>  3568 0.60500 Metis      kghazino     qw    09/14/2007 13:55:52 20
>>> ....
>>> 
>>> Note the start times on 3566, 3667, 3668.  When I set "params MONITOR=1" 
>>> in qconf -msconf, I can see that 3568 is reserving cpus:
>>> 
>>> % tail -3 /opt/n1ge6u9/default/common/schedule
>>> 3568:1:RESERVING:1190217135:660:Q:all.q at cl021.smu.acenet.ca:slots:1.000000 
>>> 3568:1:RESERVING:1190217135:660:Q:all.q at cl034.smu.acenet.ca:slots:1.000000 
>>> 3568:1:RESERVING:1190217135:660:Q:all.q at cl020.smu.acenet.ca:slots:1.000000 
>>> 
>>> This looks suspiciously like a case mentioned on this mailing list in Dec 
>>> 2006 by Jean-Paul Minet, but no answer to his final query appears in the 
>>> archives.  Why are the smaller jobs getting in front of the reserving job? 
>>> What am I missing?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list