[GE users] Queue subordination and custom complexes

David Olbersen dolbersen at nextwave.com
Tue Apr 8 17:19:33 BST 2008


*) This brings up an idea for an RFE to get it working - anyone think
it's useful?:

limit q1*4,q2 hosts {*} to slots=8

I think this kind of syntax would be handy. In my case it would be even
better if you could do some math, e.g.

limit q1, q2 hosts {*} to slots=$cpu

Or

limit q1, q2 hosts {*} to slots=$cpu*4

-- 
David Olbersen
 

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Monday, April 07, 2008 3:55 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Queue subordination and custom complexes

Hi David,

Am 07.04.2008 um 19:15 schrieb David Olbersen:
> So I've tried this on my lab cluster and see that I can set the number

> of job slots as you say.
> That looks pretty good, but there's still the problem of 
> oversubscription.
>
> For example, node-1 is in the "@dualcores" hostgroup.
> Q1 says:
> 	slots                 4,[@dualcores=2]
> Q2 says:
> 	slots                 16,[@dualcores=8]

great.

> The problem is that the machine can end up running 10 jobs. That's not

> how I need it to work.
> Any of the following mixes would be OK:
> 2 jobs from q1, 0 from q2	(q1 is allowed to dominate)
> 0 jobs from q1, 8 from q2	(q2 is allowed to dominate)
> 1 job from q1, 4 from q2	(sharing)

The is no core affinity for now (unless you implement it on your own).
Hence the kernel would share these 2 cores to 5 processes with his own
scheduler. You could of course try to give q2 a nice value of
19 (priority setting in the queue configuration). But there is no
guarantee what amount of time will be given to each process then.

> Using just job slot tuning at the queue-cluster level I can end up 
> with
> 2 jobs from q1, 8 from q2. That's too many.
>
> Any suggestions?

If you have an hierarchy between these two queues: you could use in one
of them a suspend threshold to drop the load.

> Maybe the problem is that I'm trying to treat q1 and q2 as equals (no 
> job suspension) and that just won't work using this configuration.

This was exactly the point I was wondering about all the time. If you
would use subordination and suspend either queue, all would be fine.  
Even 6.1 wouldn't help here*. As you mentioned "to get around the
waiting jobs in q2" in your original post: do you want to suspend by
hand?

-- Reuti

*) This brings up an idea for an RFE to get it working - anyone think
it's useful?:

limit q1*4,q2 hosts {*} to slots=8


> --
> David Olbersen
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, April 01, 2008 2:55 PM
> To: David Olbersen
> Subject: PM: Re: [GE users] Queue subordination and custom complexes
>
> Hey David,
>
> don't give up so early ;-) Just forget for a few minutes completely 
> about your complex.
>
> Am 01.04.2008 um 23:22 schrieb David Olbersen:
>> Reuti,
>>
>>> So, contrary to your first post, you don't want to use subordination

>>> any longer - where only one queue is active at a given point in time

>>> and the others are suspended?
>>
>> That's not true at all!
>>
>> In the first post I describe my experiences trying to configure queue

>> subordination when exechost complexes are being used. My experience 
>> is
>
>> that this does not work -- jobs don't get suspended. I wondered out 
>> loud if maybe it was because the exechost complex wouldn't be 
>> considered "released" when the job was suspended.
>>
>> You replied suggesting I move these complexes from the exechosts to 
>> the queues.
>>
>> I replied trying to explain why that doesn't make sense to me: this 
>> complex is by definition host-specific. Moving the complex to the 
>> queue level would require a hardware homogenousness I don't have.
>
> Nope, here is nothing homogenous in our configuration I posted:
>
> slots                 2,[@p3-1100=1],[node10=1],[node02=1],[node03=1],
> [node09=1]
>
> and to explain it for your configuration by using hostgroups or each
> node:
>
> high.q:
> slots                 1,[@quad_cores=4],[@dual_cores=2]
> subordinate_list mid.q=1,low.q=1
>
> mid.q:
> slots                 2,[@quad_cores=8],[@dual_cores=4]
> subordinate_list low.q=1
>
> low.q:
> slots                 4,[@quad_cores=16],[@dual_cores=8]
> subordinate_list NONE
>
> No slot limit in any exec_host, no custom complexes.
>
> We are speaking here of cluster-queues, and for each host there will 
> be one queue-instance residing on a host. Each host in the hostgroup 
> get his own slot count, and even in a mixed cluster: each host get the

> number of slot it deserves.
>
> -- Reuti
>
>
>>
>> Then you suggested that I change the number of slots on each 
>> exechost,
>
>> rather than using the complex I have set up.
>>
>> I replied suggesting that doesn't make sense to me since if I set the

>> slot count too high, I get more jobs on a machine than I want, and if

>> I set it too low I end up wasting resources.
>>
>> It sounds like this just isn't going to work. Thanks for your time 
>> and
>
>> effort.
>>
>> --
>> David Olbersen
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, April 01, 2008 1:10 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Queue subordination and custom complexes
>>
>> Am 01.04.2008 um 18:28 schrieb David Olbersen:
>>> Reuti,
>>>
>>> We want to use a DOUBLE because we consider some of our jobs to use 
>>> less than a whole CPU. We have some jobs that need to run that never

>>> do very much CPU processing at all. For example, we have one type of

>>> job which we consider to use 1/4 of a CPU.
>>>
>>> The "smaller" jobs only request 1/4 of a CPU via "-l cores=0.25".  
>>> The
>
>>> queue these jobs run in has it's slot count set to 16 (4 cores * 4 
>>> jobs per core = 16). However, these machines may also be used by 
>>> queues which use whole, or even multiple CPUs. So in this situation,

>>> what would I set the slots attribute to on this machine? 1? 4?
>>> 16? It
>
>>> seems impossible to set it correctly -- if I set it to 16 I can have

>>> an over-subscribed (by your definition) machine. If I set it to 4 I 
>>> can still have an over-subscribed machine if some multi-threaded 
>>> jobs
>
>>> come along. If I set it to 1 I'll end up wasting resources.
>>
>> So, contrary to your first post, you don't want to use subordination 
>> any longer - where only one queue is active at a given point in time 
>> and the others are suspended?
>>
>> -- Reuti
>>
>>
>>> --
>>> David Olbersen
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, April 01, 2008 12:36 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>
>>> Am 01.04.2008 um 00:11 schrieb David Olbersen:
>>>> Reuti,
>>>>
>>>>> What you can do: attach the resource to the queues, not to the 
>>>>> host.
>>>>> Hence every queue supplies the specified amount per node on its 
>>>>> own.
>>>>
>>>> I think you're missing the idea. My "cores" complex is the same as 
>>>> the
>>>
>>>> "num_procs" except a DOUBLE instead of an INT. Specifying it on a 
>>>> per-queue basis isn't appropriate since I'm trying to over- 
>>>> subscribe
>>
>>>> my hosts. Also, my hosts have varying numbers of cores (2 or 4).
>>>
>>> It is appropriate, as it is the limit per queue instance in a queue
>>> definition:
>>>
>>> slots                 2,[@p3-1100=1],[node10=1],[node02=1],
>>> [node03=1],
>>> [node09=1]
>>>
>>> But the term "over-subscribe" usually means to have more jobs 
>>> running
>
>>> at the same time than cores are in the machine. But it seems you 
>>> want
>
>>> to avoid over-subscription.
>>>
>>> Therefore you can also set "slots" in each exec hosts configuration 
>>> and both limits will apply per node (or even use an RQS for it). It 
>>> just fills the node form different queues and avoids 
>>> oversubscription.
>>> But if
>>> you want to use subordination (as you stated in your first post), 
>>> you
>
>>> mustn't specify it on a per node basis at all. Just set 
>>> "subordinate_list other.q=1" and other.q will get suspended as soon 
>>> as
>>
>>> one slot is used in the current queue.
>>>
>>> But I don't get the clue, why you want to have a DOUBLE for it.
>>>
>>> -- Reuti
>>>
>>>
>>>> To elaborate: we want to give each job a whole CPU to play with.
>>>> On a
>>
>>>> 4-processor machine that means only 4 jobs can run.
>>>>
>>>> However, to get the most utilization out of a machine, we may allow

>>>> many queues to run on it, to the point of having 8-12 slots total.
>>>> However, if all 8 or 12 slots were full on the one machine, we'd 
>>>> have
>>
>>>> more jobs/CPU than we really want, causing all the jobs to slow 
>>>> down.
>>>>
>>>> To accommodate this situation, each job requires 1 "cores"
>>>> consumable by
>>>> default. This makes it such that any mixture of jobs from various 
>>>> queues can run on the machine, so long as there are still "cores"
>>>> available. It
>>>> also means that if a job is multi-threaded and needs all 4 cores, 
>>>> it
>
>>>> can request as much and consume an entire machine.
>>>>
>>>> For example: node-a has 4 CPUs and is in q1, q2, and q3. q1, q2, 
>>>> and
>>>> q3 are set to put 4 slots on each machine they're on. This means 
>>>> that
>>
>>>> node-a has 12 slots, but only 4 cpus. I set its "cores" complex = 
>>>> 4.
>>>> Now any combination of 4 jobs from queues q1, q2, and q3 can run.
>>>> This
>>>
>>>> gets the most utilization out of the machine.
>>>>
>>>> So given that this resource has to remain at the node-level, are 
>>>> there
>>>
>>>> any ways to get around this? Maybe give the resource back when the 
>>>> job
>>>
>>>> gets suspended, then take it back when it gets resumed?
>>>>
>>>> --
>>>> David Olbersen
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Monday, March 31, 2008 10:37 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>>
>>>> Hi,
>>>>
>>>> Am 31.03.2008 um 18:46 schrieb David Olbersen:
>>>>> I have the following configuration in my lab cluster:
>>>>>
>>>>> Q1 runs on machines #1, #2, and #3.
>>>>> Q2 runs on the same machines.
>>>>> Q2 is configured to have Q1 as a subordinate.
>>>>> All machines have 2GB of RAM.
>>>>>
>>>>> If I submit 3 jobs to Q1 and 3 to Q2, the expected results are
>>>>> given: jobs start in Q1 (submitted first) then get suspended while

>>>>> jobs in Q2 run.
>>>>>
>>>>> Awesome.
>>>>>
>>>>> Next I try specifying hard resource requirements by adding "-hard
>>>>> - l
>>>
>>>>> mem_free=1.5G" to each job. This still ends up working out, 
>>>>> probably
>>
>>>>> because the jobs don't actually consume 1.5G of memory.
>>>>> The jobs are simple things that drive up CPU utilization by dd'ing

>>>>> from /dev/urandom out to /dev/null.
>>>>>
>>>>> Next, to further replicate my production environment I add a 
>>>>> custom
>
>>>>> complex named "cores" that gets set on a per-host basis to the 
>>>>> number
>>>
>>>>> of CPUs the machine has. Please note that we're not using 
>>>>> "num_proc"
>>>>> because we want some jobs to use fractions of a CPU and num_proc 
>>>>> is
>
>>>>> an INT.
>>>>>
>>>>> So each job will take up 1 "core" and each job has 1 "core".
>>>>> With this set up the jobs in Q1 run, and the jobs in Q2 wait. No 
>>>>> suspension happens at all. Is this because the host resource is 
>>>>> actually being consumed? Is there any way to get around this?
>>>>
>>>> yes, you can check the remaining amount of this complex with "qhost
>>>> -
>>
>>>> F cores". Or also per job: qstat -j <jobid> when "schedd_job_info 
>>>> true" in the scheduler setup). Be aware, that only complete queues 
>>>> can
>>>
>>>> be suspended, and not just some slots of them.
>>>>
>>>> What you can do: attach the resource to the queues, not to the 
>>>> host.
>>>> Hence every queue supplies the specified amount per node on its 
>>>> own.
>>>>
>>>> (sidenote: to avoid requesting the resource all the time and 
>>>> specifying the correct queue in addition, you could also have two 
>>>> resources cores1 and cores2. attach cores1 to Q1 and likewise 
>>>> cores2.
>>>> qsub -l cores2=1 will also get the Q2 queue).
>>>>
>>>> -- Reuti
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list