[GE users] SGE 5.3p6 one node getting disproportionate hits

Dr R L Oswald, Cranfield University UK R.L.Oswald at cranfield.ac.uk
Tue Jun 29 12:01:37 BST 2004

We have a small cluster of 15 identical  Xeon dual cpu servers (RH9) 
under SGE 5.3p6 intended as an HPC  batch or compute grid facility for 
student compute jobs (Fluent etc)  but single cpu only. Two further 
servers act as qmasters/shadows/frontend login hosts.

SGE configuration is done via master scripts, each server has four 
queues corresponding to "short" , "medium", "long" & "verylong" resource 
profiles requestable with -l option. Thus each server host & queue has 
the same configuration apart from hostname & queue name. Scheduling is 
by load though we have each queue  with a unique sequence number 
increasing serially.

Users cannot select a particular queue as we have made the queuename 
object non-requestable so that jobs just go to the first available 
processor ( 1 job per cpu only allowed).
 From tests, we expected that the lower numbered hosts would 
statistically handle more jobs than the higher numbered ones.

e.g. SGE allocates jobs - host1/cpu0 host2/cpu0.... host15/cpu0 then it 
goes back to host1/cpu1 host2/cpu1... host15/cpu1 (modified by load of 

Statistical analysis of the SGE accounting data shows that there is this 
approximate trend over hosts 1...15 but a huge anomaly exists in that 
the 10th host gets a disproportionate number of jobs queued to it. 
Looking at nearly 2000 job records over one month of use, the 10th host 
has taken 25% of the jobs whereas the first host has only 10% which is 
the largest % of jobs  apart from the 10th host.

Has anyone seen any similar scheduling  behaviour with SGE 5.3 or can 
offer any explanation for it ?

Les Oswald
HPC Support
Cranfield University

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list