[GE users] chkpt/migrate dealing with suspension for dedicated and shared resource balancing

Reuti reuti at staff.uni-marburg.de
Wed May 23 16:37:21 BST 2007


Am 23.05.2007 um 16:08 schrieb Patrice Seyed:

> I'm configuring a scenario which there is a set of queues that have  
> a certain resource attached for a specific project, and then the  
> rest are the general queues for all groups to use. For each general  
> queue I will create another queue that will subordinate and suspend  
> when 2 jobs running in the general queue, similar to "express"  
> queues, but without the time limit.
> To take advantage of these I will have a script to automate  
> submission to the project-specific queues but using the "-now y"  
> directive, so if it does not run immediately, it will use the  
> subordinate queue to submit jobs to using same run now directive.  
> If this is not available to run, I will have it queue up to a  
> certain number of jobs in the project-specific queue.


> The one problem I see is if the jobs go to the subordinate queue  
> and then are suspended there is no predicating how long they will  
> sit in this S state. I found the following thread on checkpointing/ 
> migrate:
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=5589
> with entries from Kirk Patton and Charu V. Chaubal

Yes, the time of the suspension is not known.

> I see that the supplied script from Kirk will after a given  
> interval of suspension perform a kill of the script. Should I  
> assume this is tweaked to taste to resubmit the job after killing  
> it in the same script?

If SGE discovers, that the job is no longer there, it will be  
rescheduled automatically for the application-level interface - so  
killing it in the migration script is safe and necessary. But if the  
migration script let the job run to a proper end - hence it is out of  
the node, then SGE will start the same job again...

> If I interpret this correctly you have to do this yourself, SGE  
> will not. If there were other efforts to make this work please let  
> me know.
> Or if there are other comments as to this technique for taking  
> advantage of a dedicated set of queues and lower priority use of  
> general queues...

Some good state transition diagrams for the various checkpoiting  
interfaces you can find here:


Maybe another checkpointing interface is more appropriate for your  

Other resources for infos:

man sge_ckpt
man checkpoint

HTH - Reuti

> I did initially find Sean's Dilda's thread:
> http://gridengine.sunsource.net/servlets/ReadMsg? 
> listName=users&msgNo=15053
> on high and low priority queues but I don't think it avoids the  
> possibility of running 4 jobs on the same machine, via the 2 2-slot  
> queues. The load_threshold technque is nifty, but I think the  
> subordinate queue strategy reaches similar result and will not  
> allow more than 3 concurrent jobs on a node if you suspend on 2.
> I could also work around this and create a third set of queues to  
> take a section of the general queues for long running jobs, so  
> suspension periods will be minimized, but then you are limiting  
> potentially free resources.
> -Patrice
> -- 
> Patrice Seyed
> Linux System Administrator - LinGA
> Boston University Medical Campus
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list