[GE users] Specifying maximum number of jobs per node

Bradford, Matthew matthew.bradford at eds.com
Mon Sep 22 10:00:09 BST 2008

We have a similar problem to Craig, and I don't think the suggested
solution quite fits our requirements.
We have a cluster containing both 4 cores and 8 cores nodes, with all
nodes being allowed to run any job if they are available. We don't want
to partition the cluster up in terms of types of batch/PE jobs so that
any job could run on any node, however, we also have a requirement that
if an MPI parallel job, which spans more than 1 node, is running, then
no other jobs may run on those nodes. If a serial, single core job is
running on a node, then other single core jobs can also run on that
node, but no parallel jobs can be started on that node.
We currently use mutual subordination between queues, with a parallel
queue with a single slot and various PEs and also a serial queue, with 1
slot per core.
Due to the issues with queue subordination preventing resource
reservation functioning correctly, we are looking at having a
configuration with a single, or as few queues as possible, with 1 slot
per core and no queue subordination. When users only want to request the
number of cores for a specific job, then this is fine, as we can have
parallel environments with allocation rules locked down to either 4
cores or 8 cores.
If a user submits a request such as:
qsub -pe mpi_* 32 mpi_application
then SGE will fit the job on either 8 4-core machines or 4 8-core
machines, which is fine, and the usage accounting is accurate. (We are
accounted for as NSLots x Time.)
The problem we have is that we sometimes have a case where the user may
want to specify the number of nodes over which they want to execute the
job, and only want to use 2 cores per node. Such as:
qsub_wrapper -pe mpi_* 8x2 mpi_application
but they don't want any other jobs to be able to start on those nodes.
If we multiply the requested nodes by 4 in the qsub_wrapper, then the
job could run on 8 4-cores nodes, as the requested 32 slots would use up
all the slots on those nodes, and the start-up script for the selected
parallel environment would modify the PE machine file accordingly to
only add any one node twice. In this way, SGE thinks the node is full,
it accounts correctly for the usage, but the integrated PE only tries to
start 2 processes per node.
This is fine when we are in a homogeneous cluster with all nodes having
the same number of cores as it allows us to multiply each slot request
by a constant. When we have a cluster that contains 4 and 8 core
machines, then we don't know what constant to multiply the slot request
by in the qsub_wrapper at submission time, and therefore, in the above
example, the job may run on four 8-core machines rather than eight
4-core machines. 
We need to be able:
1. to allow users to specify the number of nodes, 
2. to allow exclusive access to that node, 
3. to account correctly using the RESERVED_USAGE parameters, (1 slot per
core and all slots used up for a running job).
4. Not use subordination as it breaks resource reservation.
If this doesn't make any sense then I'll have another go at explaining
Any help would be much appreciated.
Thanks very much,

More information about the gridengine-users mailing list