[GE users] Advanced scheduling with checkpointing

Reuti reuti at staff.uni-marburg.de
Fri Sep 26 21:46:25 BST 2008


Hi Gerald,

Am 26.09.2008 um 19:53 schrieb Gerald Ragghianti:

> Hi Reuti,
> Thanks for the ideas.  Am I correct that in the first scenario each  
> job's priority (and corresponding queue) would have to be a  
> permanent feature of each job (i.e. a job could not start as high  
> priority and later become low priority).

no. You can use qalter to change priority, override tickets,  
queue,...  of the job (even while it's running). When the job is  
rescheduled to waiting state, these changed settings will be honored.  
They have no effect as long as the job is still in running state though.

-- Reuti


>   In that case, I think the co-scheduler would be the correct  
> solution for our user base.  I'll have to test these options out soon.
> Reuti wrote:
>> Hi,
>>
>> Am 26.09.2008 um 18:49 schrieb Gerald Ragghianti:
>>
>>> I have a certain scheduling policy that I would like to  
>>> implement, but I am having trouble determining if it is even  
>>> possible with SGE.  I would like to have job priorities  
>>> determined by share tree tickets (no problem there).  Then I want  
>>> jobs to be checkpointed/suspended or started/resumed based on the  
>>> job priorities each iteration.  This would allow us to remove all  
>>> limits on number of used job slots per user while still ensuring  
>>> low queue times for those with high enough priority.  This seems  
>>> like a kind of panacea of scheduling algorithms (and relatively  
>>> simple), but I have yet to find a resource manager that will  
>>> support it.
>>>
>>> So can SGE do this or something close to it?
>>
>> once a job is in running state, SGE will not move it again to a  
>> waiting state based on share-tree. What you can implement is:
>>
>> - one queue for the low priority jobs, which must already support  
>> checkpointing on their own
>> - define the checkpoint environment to migrate on suspend
>> - one high priority queue for certain jobs
>> - subordinate the low priority queue to this high priority queue
>> - the low priority queue will get suspended, means migrate the job  
>> and requeue the low priority job
>> (advantage compared to a simple subordination is, that the low  
>> priority job can start again, when another node becomes free,  
>> instead of waiting for exactly this high priority job on the same  
>> node to end)
>>
>> This has the pitfall, that resources will only be released after  
>> the low priority job had left the node. Means, that depending on  
>> your submission request, the high priority job can't start because  
>> of lack of resources, although they would be available soon for  
>> the job when it starts. However, if you code all resources as RQS,  
>> then these can be bound to queues and present the resources to be  
>> available for each queue independently.
>>
>> Another option would be to have a co-schedule, which will send the  
>> migrate command to the low priority job, when he discovers a  
>> waiting high priority job (while also disabling the low priority  
>> queue until the high priority job has started and blocks that  
>> queue on its own).
>>
>> -- Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> -- 
> Gerald Ragghianti
> IT Administrator - High Performance Computing
> http://hpc.usg.utk.edu/
> Office of Information Technology
> University of Tennessee
> Phone: 865-974-2448
> E-mail: geri at utk.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list