[GE users] scheduling strategy

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Dec 4 13:10:38 GMT 2007


Hi Jan,

On Wed, 28 Nov 2007, Jan Sundermeyer wrote:

> Andreas.Haas at Sun.COM schrieb:
>> Hi Jan,
>>
>> On Tue, 27 Nov 2007, Jan Sundermeyer wrote:
>>
>>> Hello,
>>>
>>> we have installed sge 6.1 with a standard strategy, which means:
>>>
>>> There is a high/normal/low-priority queue.
>>> Lower priority queues are suspended when the higher priorities load the
>>> machines completely. It does not work perfectly as the use of
>>> subordinates only suspends complete queues but it should be okay.
>>>
>>> My problem now is that actually i would like to have a different set-up
>>> which does not need priority queues.
>>> The target would be fair utilization of the machines.
>>>
>>> For example: 2 users start a number of jobs
>>> every users should get the same number of concurrently running jobs.
>>>
>>> If they start their jobs the same time, fine.
>>> But if one starts it before the other, the queues are full and the next
>>> user has to wait until the jobs are finished.
>>
>> With functional and/or share tree ticket policy you can implement this
>> (witout separate queues), but that doesn't help you to achieve preemption.
>> That means users must wait until resources become available again. That
>> seems nasty, but resource quotas could be used to mitigate the problem.
>> E.g. if you have 200 slots in your SGE cluster, you could configure a
>> resource quota limit like
>>
>>   limit users {*} to slots=150
>>
>> to prevent users grabbing more than 75% at a time.
>>
>>>
>>> I would prefer if sge makes room for the new jobs by suspending jobs
>>> from the first user, so that a fair share is reached as soon as possible.
>>> This way any user could start as many jobs as he likes and limitations
>>> come up only if other users need resources as well.
>>>
>>> One way of doing this might be the use of checkpoint after time periods,
>>> which with our simulator (spectre) leads pratically to reschedule of the
>>> jobs. However it takes some time for simulator to recover to the last
>>> state. Therefore it would be preferable to do checkpointing only when it
>>> is necessary.
>>
>> Getting the checkpointing be done only when needed is just one part of
>> the solution. Yet the above reads as if you intend to enforce fair
>> share. How will you be doing this? Are you thinking of a co-scheduler
>> that triggers job preemption?
>>
> Hi Andreas,
>
> i was thinking of a co-scheduler.
> How is this best implemented ?

general approach is to periodically retrieve relevant job status information, feed 
it in a co-scheduling decision making component and use qmod -sj <jobid> or qmod 
-rj <jobid> to trigger job migration or rescheduling. This is how it would work 
at large, yet I must confess I have no concrete experiences in this area. With 
regard to job priorities I would expect no special pampering of jobs is needed, 
since resources anyways would get utilized based on share policy once they were 
released through job preemption.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list