[GE users] Advanced scheduling with checkpointing

Gerald Ragghianti geri at utk.edu
Fri Sep 26 18:53:57 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,
Thanks for the ideas.  Am I correct that in the first scenario each 
job's priority (and corresponding queue) would have to be a permanent 
feature of each job (i.e. a job could not start as high priority and 
later become low priority).  In that case, I think the co-scheduler 
would be the correct solution for our user base.  I'll have to test 
these options out soon. 

Reuti wrote:
> Hi,
>
> Am 26.09.2008 um 18:49 schrieb Gerald Ragghianti:
>
>> I have a certain scheduling policy that I would like to implement, 
>> but I am having trouble determining if it is even possible with SGE.  
>> I would like to have job priorities determined by share tree tickets 
>> (no problem there).  Then I want jobs to be checkpointed/suspended or 
>> started/resumed based on the job priorities each iteration.  This 
>> would allow us to remove all limits on number of used job slots per 
>> user while still ensuring low queue times for those with high enough 
>> priority.  This seems like a kind of panacea of scheduling algorithms 
>> (and relatively simple), but I have yet to find a resource manager 
>> that will support it.
>>
>> So can SGE do this or something close to it?
>
> once a job is in running state, SGE will not move it again to a 
> waiting state based on share-tree. What you can implement is:
>
> - one queue for the low priority jobs, which must already support 
> checkpointing on their own
> - define the checkpoint environment to migrate on suspend
> - one high priority queue for certain jobs
> - subordinate the low priority queue to this high priority queue
> - the low priority queue will get suspended, means migrate the job and 
> requeue the low priority job
> (advantage compared to a simple subordination is, that the low 
> priority job can start again, when another node becomes free, instead 
> of waiting for exactly this high priority job on the same node to end)
>
> This has the pitfall, that resources will only be released after the 
> low priority job had left the node. Means, that depending on your 
> submission request, the high priority job can't start because of lack 
> of resources, although they would be available soon for the job when 
> it starts. However, if you code all resources as RQS, then these can 
> be bound to queues and present the resources to be available for each 
> queue independently.
>
> Another option would be to have a co-schedule, which will send the 
> migrate command to the low priority job, when he discovers a waiting 
> high priority job (while also disabling the low priority queue until 
> the high priority job has started and blocks that queue on its own).
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


-- 
Gerald Ragghianti
IT Administrator - High Performance Computing
http://hpc.usg.utk.edu/
Office of Information Technology
University of Tennessee
Phone: 865-974-2448
E-mail: geri at utk.edu


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list