[GE users] Limit Number of Jobs on Exec Hosts
dag at sonsorol.org
Wed Jan 20 20:46:05 GMT 2010
> I know about setting resource requirements but some of the larger
simulations can use up a large amount of memory, sometimes up to 6GB or
more. If I can change it so that only one or two jobs can run on each
execution host then I don't have to worry about two jobs starting up on
the same host which may both end up using 6GB, crashing the system in
Controlling the number of jobs allowed to run on an execution host is
one of the easiest things to do in Grid Engine so you should be able to
make some progress in that area ...
Grid jobs run in "job slots" and the number of job slots available on
the execution host will be your upper limit on the number of jobs that
can run concurrently. If you set the number of job slots to 2 on each
host than you'll never have more than 2 jobs ever running at a time.
There are some permutations here that I'll list, hopefully with enough
keywords to help you find the details in the documentation
1. If you want to hard code slot numbers for execution hosts you can do
this via the queue configuration process. Do a "qconf -sq all.q" and
look at the slots line to see what the syntax looks like
2. If you have big threaded/SMP jobs that you know will need all of the
available resources on the system and you are using SGE 6.2 or later
than you can take advantage of "exclusive host access". In this scenario
you might have a node with 2 slots free but if you submit an exclusive
job then it will block both slots while running so it gets the machine
2.5 If you are not running 6.2 or later and still want to allow a "big"
job to take over all the available slots on the execution host then
google around for "threaded PE hack" or search for "threaded PE" on
gridengine.info for advice on how to abuse the SGE parallel environment
tools to get the behavior you desire
3. A more dynamic way to adjust the number of jobs allowed to run at a
time is via the Resource Quota System ("RQS"). You quickly and easily
alter resource quota rules to constrain the number of jobs allowed and
where they can run. The command line interface to this system is
scriptable enough that you could automate it if it made sense
4. You mentioned that your machines are used by people and tasks outside
of SGE. You should look carefully at the load sensor and load threshold
bits of SGE's documentation. There are multiple ways that a compute node
can be configured to monitor "how busy" the system is with non-SGE jobs.
When the "busy" threshold is exceeded, SGE will close itself off until
the load is reduced below the configurable threshold.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users