[GE users] Newbie question - queues, queue instances, and slots
gladden at chem.washington.edu
Mon Jun 1 23:56:48 BST 2009
[ The following text is in the "Windows-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Thank you for your prompt reply. I stumbled into the question posed in my post after reading the document titled "BEGINNER'S GUIDE TO
SUN? GRID ENGINE 6.2 - Installation and Configuration White Paper - September 2008" which I found on the SGE project website (https://www.sun.com/offers/details/Sun_Grid_Engine_62_install_and_config.html). The document include an appendix titled "Appendix A Configuring an SMP/Batch Cluster".
This appendix contains suggestions for setting up a cluster that will run shared memory jobs, non-shared memory parallel jobs, and serial jobs. The appendix suggests setting up at least two queues, one for SMP jobs, and one or more for everything else. It further suggested changing the "queue_sort_method" from "load" to "seqno" and then explicitly providing sequence numbers for each queue instance - with the queue instances for the SMP queue being ordered in a monotonically increasing fashion and the queue instances for the other queues being numbered in the opposite manner. If I understand this suggestion correctly, this will cause the scheduler to "pack" the non-SMP jobs onto nodes starting from one end of the node pool, and allocate nodes to SMP jobs starting from the other end. This apparently avoid the situation wherein the scheduler effectively balkanizes the node pool by scattering non-SMP jobs among the least loaded nodes.
After some experimentation, I became suspicious that this suggestion would in fact create a system which, when heavily subscribed, would allow the scheduler to assign at least two jobs to each CPU - one from the SMP queue and one from each of the other queues. This led me to perform the two-queue experiment that was the subject of my original post.
So, with that preamble, let me back up and explain what I am trying to do. We are setting up a new cluster for academic research computing that will consist of 55 dual-quad core nodes. The work load is primarily a mixture of serial jobs (locally developed Monte Carlo codes, etc.) and parallel jobs (mostly Gauusian and VASP). The parallel jobs are capable of using both shared memory and the network interconnect for inter-process communication.
In general, the parallel codes don't scale all that well beyond 8 processes, so the most commonly desired scenario will be to allocate an entire node to a parallel job. I have done some testing of the SGE parallel environment, and it appears if I setup a PE with "allocation_rule=$pe_slots" then jobs requesting this PE will always have all processes allocated on the same node. So it appears that I can handle this case just by setting up a PE for this specific case.
There are however, a couple of slightly more complicated parallel cases. First, benchmarking indicates that it is not always productive, when running Gaussian or VASP, to actually utilize all eight cores on a node - memory bandwidth starvation results. So in some cases it might well be useful if a job could request exclusive use of two or mode nodes, with the explicit intent of actually only starting four processes on each node. Conversely, there may be a few cases where utilizing 16 CPUs, allocated exclusively across two nodes, may prove useful. As near as I can determine, SGE does not offer the ability to request specific "job geometries" (n processes on m nodes), so I am not clear as to how one handles cases like these.
Serial jobs are likely to be the minority case, at least if you count them by actual CPU seconds used. As I alluded to earlier, in this case I believe it is desirable to have the scheduler "pack" serial jobs onto nodes rather then spread them out across idle nodes. The latter scenario could seriously impact the parallel job throughput by making it hard for the scheduler to find empty nodes for parallel jobs that require dedicated use of a node. We could of course divide the nodes into separated pools for serial and parallel job use, but this does seem to be a scenario likely to produce an efficient use of resources.
I suspect there is nothing particularly unusual about our job mix or computing environment, so I assume that other admins have found satisfactory solutions for these situations. I would appreciate any suggestions you might be able to offer.
A few things:
- Your priority setting has no effect SGE or the order in which jobs
are dispatched. The param you are setting effectively is setting the
unix nice level of your tasks - this is an OS thing that has nothing
to do with SGE policies, resource allocation or scheduling. Most
people don't use this parameter unless they are intentionally over
subscribing (more running jobs than CPUs on each machine)
- There are a few different ways to get at what you want, not sure
if its worth going through the details if you are just experimenting
at this point. If you can clearly explain what you want the system to
do we can probably suggest ways to implement it
If you insist on keeping 2 cluster queues you may want to search the
SGE docs for information on "subordinate queues" - that would let you
set up a system by which the low priority queue stops accepting work
when the high priority queues are occupied
You may also want to read up on SGE Resource Quotas which are a
fantastic tool - reading the docs on that may give you some ideas for
resource quotas that would let you simplify the queue structure. For
example it would be trivial to create a global resource quota that
does not let the system have more than 10 active jobs at any one time
-- this is one way to deal with the "20 jobs / 10 processor" issue you
Welcome to SGE!
On Jun 1, 2009, at 2:29 PM, jagladden wrote:
I am new to SGE, so I am trying it out on a small test cluster as a
first step. Having done some experiments, I find myself a little
confused about how SGE handles queue instances and slots.
My test cluster has two compute nodes, with a total of 10 cores, as
shown by 'qhost':
[root at testpe bin64]# qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE
global - - - -
- - -
compute-0-0 lx26-amd64 2 0.00 2.0G 102.8M
compute-0-1 lx26-amd64 8 0.00 15.7G 119.9M
I have set up two cluster queues. The first of these is the
standard default queue 'all.q' as shown by 'qconf -sq':
[root at testpe ~]# qconf -sq all.q
qtype BATCH INTERACTIVE
pe_list make mpich mpi orte
The second is a "high priority" queue, which is identical except for
having a higher default job priority:
[root at testpe ~]# qconf -sq high
qtype BATCH INTERACTIVE
My point of confusion arises when I submit jobs to both these
queues. There are only 10 CPU's available, and I would expect the
queuing system to only allow a maximum of 10 jobs to run at any one
time. What happens in practice is that SGE allows 10 jobs from each
of the two queues to run a the same time, for a total of 20 jobs,
thus effectively allocating two jobs to each CPU. In the following
example I have submitted 24 jobs, 12 to each queue. Note that
'qstat' shows 20 of them to be running simultaneously, with four
[gladden at testpe batchtest]$ qstat
job-ID prior name user state submit/start at
queue slots ja-task-ID
110 0.55500 test_simpl gladden r 06/01/2009 10:08:37 all.q at compute-0-0.local
114 0.55500 test_simpl gladden r 06/01/2009 10:08:43 all.q at compute-0-0.local
109 0.55500 test_simpl gladden r 06/01/2009 10:08:37 all.q at compute-0-1.local
111 0.55500 test_simpl gladden r 06/01/2009 10:08:40 all.q at compute-0-1.local
112 0.55500 test_simpl gladden r 06/01/2009 10:08:40 all.q at compute-0-1.local
113 0.55500 test_simpl gladden r 06/01/2009 10:08:40 all.q at compute-0-1.local
115 0.55500 test_simpl gladden r 06/01/2009 10:08:43 all.q at compute-0-1.local
116 0.55500 test_simpl gladden r 06/01/2009 10:08:43 all.q at compute-0-1.local
117 0.55500 test_simpl gladden r 06/01/2009 10:08:46 all.q at compute-0-1.local
118 0.55500 test_simpl gladden r 06/01/2009 10:08:46 all.q at compute-0-1.local
121 0.55500 test_simpl gladden r 06/01/2009 10:09:08 high at compute-0-0.local
126 0.55500 test_simpl gladden r 06/01/2009 10:09:11 high at compute-0-0.local
122 0.55500 test_simpl gladden r 06/01/2009 10:09:08 high at compute-0-1.local
123 0.55500 test_simpl gladden r 06/01/2009 10:09:08 high at compute-0-1.local
124 0.55500 test_simpl gladden r 06/01/2009 10:09:08 high at compute-0-1.local
125 0.55500 test_simpl gladden r 06/01/2009 10:09:11 high at compute-0-1.local
127 0.55500 test_simpl gladden r 06/01/2009 10:09:11 high at compute-0-1.local
128 0.55500 test_simpl gladden r 06/01/2009 10:09:11 high at compute-0-1.local
129 0.55500 test_simpl gladden r 06/01/2009 10:09:14 high at compute-0-1.local
130 0.55500 test_simpl gladden r 06/01/2009 10:09:14 high at compute-0-1.local
119 0.55500 test_simpl gladden qw 06/01/2009
120 0.55500 test_simpl gladden qw 06/01/2009
131 0.55500 test_simpl gladden qw 06/01/2009
132 0.55500 test_simpl gladden qw 06/01/2009
What I had expected was that SGE would first dispatch 10 jobs from
the "<mailto:queue,whichisidenticalexceptforhavingahigherdefaultjobpriority:[root at testpe~]#qconf-sqhighqnamehighhostlist at allhostsseq_no0load_thresholdsnp_load_avg=1.75suspend_thresholdsNONEnsuspend1suspend_interval00:05:00priority10min_cpu_interval00:05:00processorsUNDEFINEDqtypeBATCHINTERACTIVEckpt_listNONEpe_listmakererunFALSEslots1,[compute-0-0.local=2],[compute-0-1.local=8]...MypointofconfusionariseswhenIsubmitjobstoboththesequeues.Thereareonly10CPU'savailable,andIwouldexpectthequeuingsystemtoonlyallowamaximumof10jobstorunatanyonetime.WhathappensinpracticeisthatSGEallows10jobsfromeachofthetwoqueuestorunathesametime,foratotalof20jobs,thuseffectivelyallocatingtwojobstoeachCPU.InthefollowingexampleIhavesubmitted24jobs,12toeachqueue.Notethat'qstat'shows20ofthemtoberunningsimultaneously,withfourwaiting:[gladden at testpebatchtest]$qstatjob-IDpriornameuserstatesubmit/startatqueueslotsja-task-ID------------------------
-----------------------------------------------------------------------------------------1100.55500test_simplgladdenr06/01/200910:08:37all.q at compute-0-0.local11140.55500test_simplgladdenr06/01/200910:08:43all.q at compute-0-0.local11090.55500test_simplgladdenr06/01/200910:08:37all.q at compute-0-1.local11110.55500test_simplgladdenr06/01/200910:08:40all.q at compute-0-1.local11120.55500test_simplgladdenr06/01/200910:08:40all.q at compute-0-1.local11130.55500test_simplgladdenr06/01/200910:08:40all.q at compute-0-1.local11150.55500test_simplgladdenr06/01/200910:08:43all.q at compute-0-1.local11160.55500test_simplgladdenr06/01/200910:08:43all.q at compute-0-1.local11170.55500test_simplgladdenr06/01/200910:08:46all.q at compute-0-1.local11180.55500test_simplgladdenr06/01/200910:08:46all.q at compute-0-1.local11210.55500test_simplgladdenr06/01/200910:09:08high at compute-0-0.local11260.55500test_simplgladdenr06/01/200910:09:11high at compute-0-0.local11220.55500test_simplgladdenr06/01/200910:09:08high at compute-0-1.
local11230.55500test_simplgladdenr06/01/200910:09:08high at compute-0-1.local11240.55500test_simplgladdenr06/01/200910:09:08high at compute-0-1.local11250.55500test_simplgladdenr06/01/200910:09:11high at compute-0-1.local11270.55500test_simplgladdenr06/01/200910:09:11high at compute-0-1.local11280.55500test_simplgladdenr06/01/200910:09:11high at compute-0-1.local11290.55500test_simplgladdenr06/01/200910:09:14high at compute-0-1.local11300.55500test_simplgladdenr06/01/200910:09:14high at compute-0-1.local11190.55500test_simplgladdenqw06/01/200910:08:4411200.55500test_simplgladdenqw06/01/200910:08:4511310.55500test_simplgladdenqw06/01/200910:09:1211320.55500test_simplgladdenqw06/01/200910:09:131WhatIhadexpectedwasthatSGEwouldfirstdispatch10jobsfromthe>high priority" queue and then, as those jobs completed and
slots become available, dispatch and run additional jobs from the
default queue - but allowing only 10 jobs to run at one time.
Instead, SGE seems to regard that 10 queue instances associated with
the "high" queue as being associated with slots that are independent
from the 10 that are associated with "all.q".
Have I failed to configure something properly? Is there not a way
to feed jobs from multiple queues to the same set of nodes while
limiting the number of active jobs to one per CPU?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
More information about the gridengine-users