[GE users] Preemption vs dedicating nodes by group?

Marconnet, James E Mr /Computer Sciences Corporation james.marconnet at smdc.army.mil
Thu Apr 28 17:06:38 BST 2005

Sorry I'm still struggling for a suitable "scheme" using 6.0u3 for our two
competing groups.

At the moment we have half our que instances dedicated to one group, and the
other half dedicated to the other group so either group can get their
(group) runs to start immediately when they walk up to the cluster. This
wastes half our nodes most of the time. We had primary/secondary ques set
up, but when the cluster was full, the users found it unacceptable to have
to wait for an already-running job on the cluster to finish before their
first new job could start.

GE has a wonderful ability to assign jobs to nodes, but once jobs are
started on all available que instances, without a good preemption scheme,
the next user who shows up must wait for at least one job to finish before
any of his newly submitted jobs can start (FIFO). Or else the nodes must be
oversubscribed, which seems to cause intermittant subtle run problems on our

I'm researching a preemption scheme. Otherwise we will have to continue
dedicating all or a % of nodes by group. The higher % of nodes we dedicate,
the more of the cluster gets wasted in times of lower-usage by a particular
group vs the other group's usage.

I read the following in Issue 35


Set-up a preemption scheme by using a customized combination of
suspend_thresholds/load_thresholds in subordinated and superordinated queue.

Preemption of jobs running in the subordinated queue can be achieved by
using more tolerant load thresholds in the higher-prior queue, e.g.
load_thresholds np_load_avg=1.5 than at the low priority queues, e.g.
load_thresholds np_load_avg=0.75 

This ensures that jobs are still dispatched into the high priority queue
even if the machine is already full due to low priority jobs. 

To initiate preemption of jobs in the low priority queue, a suspend
threshold, e.g. suspend_thresholds np_load_avg=1.25 is used. 

In combination with load adjustmens (see under 'job_load_adjustments' in
sched_conf(5)) more immediate preemption can be achieved. 


This is promising, except for several concerns:

Subordination of multiple-slot machines is suboptimal (thus Issue 35) (about
half our nodes are dual-processor hyper-threaded) (but we could live with
this vs wasting half our nodes most of the time)

Using the high priority que will most likely result in over-subscription of
nodes. (not acceptable to us)

Suspended jobs still take up RAM. (this could be a problem on larger jobs,
don't know yet)

If jobs were killed (not sure how this could be done automatically anyway),
then their accumulated CPU time would be lost, resubmission (automatic or
manual) would be required, and the users would wonder what happened to their
jobs. (probably not acceptable)

TIA for suggestions to clear away my haze on this subject?

Jim Marconnet

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list