[GE users] Limit Number of Jobs on Exec Hosts

craffi dag at sonsorol.org
Wed Jan 20 20:46:05 GMT 2010

myiagros wrote:
 > I know about setting resource requirements but some of the larger 
simulations can use up a large amount of memory, sometimes up to 6GB or 
more. If I can change it so that only one or two jobs can run on each 
execution host then I don't have to worry about two jobs starting up on 
the same host which may both end up using 6GB, crashing the system in 
the process.

Controlling the number of jobs allowed to run on an execution host is 
one of the easiest things to do in Grid Engine so you should be able to 
make some progress in that area ...

Grid jobs run in "job slots" and the number of job slots available on 
the execution host will be your upper limit on the number of jobs that 
can run concurrently. If you set the number of job slots to 2 on each 
host than you'll never have more than 2 jobs ever running at a time.

There are some permutations here that I'll list, hopefully with enough 
keywords to help you find the details in the documentation

1. If you want to hard code slot numbers for execution hosts you can do 
this via the queue configuration process. Do a "qconf -sq all.q" and 
look at the slots line to see what the syntax looks like

2. If you have big threaded/SMP jobs that you know will need all of the 
available resources on the system and you are using SGE 6.2 or later 
than you can take advantage of "exclusive host access". In this scenario 
you might have a node with 2 slots free but if you submit an exclusive 
job then it will block both slots while running so it gets the machine 
to itself

2.5 If you are not running 6.2 or later and still want to allow a "big" 
job to take over all the available slots on the execution host then 
google around for "threaded PE hack" or search for "threaded PE" on 
gridengine.info for advice on how to abuse the SGE parallel environment 
tools to get the behavior you desire

3. A more dynamic way to adjust the number of jobs allowed to run at a 
time is via the Resource Quota System ("RQS"). You quickly and easily 
alter resource quota rules to constrain the number of jobs allowed and 
where they can run. The command line interface to this system is 
scriptable enough that you could automate it if it made sense

4. You mentioned that your machines are used by people and tasks outside 
of SGE. You should look carefully at the load sensor and load threshold 
bits of SGE's documentation. There are multiple ways that a compute node 
can be configured to monitor "how busy" the system is with non-SGE jobs. 
When the "busy" threshold is exceeded, SGE will close itself off until 
the load is reduced below the configurable threshold.



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list