[GE users] Advanced scheduling with checkpointing

Reuti reuti at staff.uni-marburg.de
Fri Sep 26 18:32:25 BST 2008


Am 26.09.2008 um 18:49 schrieb Gerald Ragghianti:

> I have a certain scheduling policy that I would like to implement,  
> but I am having trouble determining if it is even possible with  
> SGE.  I would like to have job priorities determined by share tree  
> tickets (no problem there).  Then I want jobs to be checkpointed/ 
> suspended or started/resumed based on the job priorities each  
> iteration.  This would allow us to remove all limits on number of  
> used job slots per user while still ensuring low queue times for  
> those with high enough priority.  This seems like a kind of panacea  
> of scheduling algorithms (and relatively simple), but I have yet to  
> find a resource manager that will support it.
> So can SGE do this or something close to it?

once a job is in running state, SGE will not move it again to a  
waiting state based on share-tree. What you can implement is:

- one queue for the low priority jobs, which must already support  
checkpointing on their own
- define the checkpoint environment to migrate on suspend
- one high priority queue for certain jobs
- subordinate the low priority queue to this high priority queue
- the low priority queue will get suspended, means migrate the job  
and requeue the low priority job
(advantage compared to a simple subordination is, that the low  
priority job can start again, when another node becomes free, instead  
of waiting for exactly this high priority job on the same node to end)

This has the pitfall, that resources will only be released after the  
low priority job had left the node. Means, that depending on your  
submission request, the high priority job can't start because of lack  
of resources, although they would be available soon for the job when  
it starts. However, if you code all resources as RQS, then these can  
be bound to queues and present the resources to be available for each  
queue independently.

Another option would be to have a co-schedule, which will send the  
migrate command to the low priority job, when he discovers a waiting  
high priority job (while also disabling the low priority queue until  
the high priority job has started and blocks that queue on its own).

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list