[GE users] chkpt/migrate dealing with suspension for dedicated and shared resource balancing

Patrice Seyed apseyed at bu.edu
Wed May 23 15:08:54 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I'm configuring a scenario which there is a set of queues that have a 
certain resource attached for a specific project, and then the rest are 
the general queues for all groups to use. For each general queue I will 
create another queue that will subordinate and suspend when 2 jobs 
running in the general queue, similar to "express" queues, but without 
the time limit.

To take advantage of these I will have a script to automate submission 
to the project-specific queues but using the "-now y" directive, so if 
it does not run immediately, it will use the subordinate queue to submit 
jobs to using same run now directive. If this is not available to run, I 
will have it queue up to a certain number of jobs in the 
project-specific queue.

The one problem I see is if the jobs go to the subordinate queue and 
then are suspended there is no predicating how long they will sit in 
this S state. I found the following thread on checkpointing/migrate:

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=5589
with entries from Kirk Patton and Charu V. Chaubal

I see that the supplied script from Kirk will after a given interval of 
suspension perform a kill of the script. Should I assume this is tweaked 
to taste to resubmit the job after killing it in the same script? If I 
interpret this correctly you have to do this yourself, SGE will not. If 
there were other efforts to make this work please let me know.

Or if there are other comments as to this technique for taking advantage 
of a dedicated set of queues and lower priority use of general queues...

I did initially find Sean's Dilda's thread:
http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=15053

on high and low priority queues but I don't think it avoids the 
possibility of running 4 jobs on the same machine, via the 2 2-slot 
queues. The load_threshold technque is nifty, but I think the 
subordinate queue strategy reaches similar result and will not allow 
more than 3 concurrent jobs on a node if you suspend on 2.

I could also work around this and create a third set of queues to take a 
section of the general queues for long running jobs, so suspension 
periods will be minimized, but then you are limiting potentially free 
resources.

-Patrice

-- 
Patrice Seyed
Linux System Administrator - LinGA
RHCE, SCSA
Boston University Medical Campus 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list