[GE users] Scheduler tuning

Mark Dixon m.c.dixon at leeds.ac.uk
Wed Nov 19 10:25:44 GMT 2008

On Wed, 19 Nov 2008, Robert Healey wrote:

> I'm currently using that flag, doesn't seem to help too much.  I also
> use slots as the scheduling criteria instead of load.

Hi Bob,

This may or may not be useful to you, but it may be helpful if I describe 
what we've been doing on an older cluster, and indicate the problems we 
encountered with SGE.

We have a cluster with 348 cores (4 cores per node), connected to a single 
Myrinet 2000 switch. It's running SGE 6.0u7_1 (yep, I should upgrade).

What we did:

* Set the PE allocation_rule to $fill_up, as you have done.

* Set the scheduler queue_sort_method to seqno, as you have done. This
   helped reduce the Myrinet component hop count within a parallel job.

* _Hope_ that the majority of users submit a parallel job divisible by the
   number of cores per node, which would reduce fragmentation.

* Separate-out serial jobs into a separate queue on the same hosts. Made
   the two queues mutually-subordinating. This ensured a node was only
   running parallel jobs, or only running serial jobs: this reduced
   fragmentation of parallel jobs.

Problems we encountered:

* I found that "seqno" only worked as I expected when I gave each queue
   instance a different sequence number.

   e.g. in the queue definition:

   seq_no                799,[comp00=800],[comp01=801],[comp02=802]

* Although hoping that users do something is not a great strategy, it
   actually worked pretty well. However, there's a new feature coming in
   early 2009 called "Job Submission Verifier":


   If I've understood it properly, it can be used to round-up parallel jobs
   to the nearest whole-node number of cores. This will mean I can stop
   "hoping" that users submit parallel jobs divisible by a particular
   number, and therefore help reduce fragmentation.

* Mutually-subordinating queues is a bad idea. Really bad. The Resource
   Reservation feature doesn't understand this use case: it can only
   reserve unsubordinated resources. With 6.0u7_1 we also occasionally see
   jobs being scheduled to both queues on the same host at the same time -
   but I've been unable to reproduce this on a test system.

   I don't really have a simple alternative to this. Serial jobs could
   fragment parallel jobs quite badly. One idea is to use have the parallel
   queue subordinate the serial, and for serial jobs to be then
   checkpointed and migrated using BLCR. Published methods
   to do this rely on a specially-written user submission script:


   However, I've developed a config for SGE which allows BLCR to
   transparently checkpoint SGE jobs. I really ought to write this up and
   submit it to the list.

Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list