[GE users] Queue subordination and custom complexes

Reuti reuti at staff.uni-marburg.de
Tue Apr 8 18:09:40 BST 2008


Am 08.04.2008 um 18:19 schrieb David Olbersen:
> *) This brings up an idea for an RFE to get it working - anyone think
> it's useful?:
>
> limit q1*4,q2 hosts {*} to slots=8
>
> I think this kind of syntax would be handy. In my case it would be  
> even
> better if you could do some math, e.g.
>
> limit q1, q2 hosts {*} to slots=$cpu
>
> Or
>
> limit q1, q2 hosts {*} to slots=$cpu*4

This is already implemented, but wouldn't solve your problem I think:

http://gridengine.sunsource.net/nonav/source/browse/~checkout~/ 
gridengine/doc/devel/rfe/ResourceQuotaSpecification.html

It would set the slot count on a host level, but you want q1 and q2  
to have a different weight, hence the sum of (q1 * 4 + q2) should not  
exceed 8. This could mean any combinations of "slots" in both queues.

-- Reuti


> -- 
> David Olbersen
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Monday, April 07, 2008 3:55 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Queue subordination and custom complexes
>
> Hi David,
>
> Am 07.04.2008 um 19:15 schrieb David Olbersen:
>> So I've tried this on my lab cluster and see that I can set the  
>> number
>
>> of job slots as you say.
>> That looks pretty good, but there's still the problem of
>> oversubscription.
>>
>> For example, node-1 is in the "@dualcores" hostgroup.
>> Q1 says:
>> 	slots                 4,[@dualcores=2]
>> Q2 says:
>> 	slots                 16,[@dualcores=8]
>
> great.
>
>> The problem is that the machine can end up running 10 jobs. That's  
>> not
>
>> how I need it to work.
>> Any of the following mixes would be OK:
>> 2 jobs from q1, 0 from q2	(q1 is allowed to dominate)
>> 0 jobs from q1, 8 from q2	(q2 is allowed to dominate)
>> 1 job from q1, 4 from q2	(sharing)
>
> The is no core affinity for now (unless you implement it on your own).
> Hence the kernel would share these 2 cores to 5 processes with his own
> scheduler. You could of course try to give q2 a nice value of
> 19 (priority setting in the queue configuration). But there is no
> guarantee what amount of time will be given to each process then.
>
>> Using just job slot tuning at the queue-cluster level I can end up
>> with
>> 2 jobs from q1, 8 from q2. That's too many.
>>
>> Any suggestions?
>
> If you have an hierarchy between these two queues: you could use in  
> one
> of them a suspend threshold to drop the load.
>
>> Maybe the problem is that I'm trying to treat q1 and q2 as equals (no
>> job suspension) and that just won't work using this configuration.
>
> This was exactly the point I was wondering about all the time. If you
> would use subordination and suspend either queue, all would be fine.
> Even 6.1 wouldn't help here*. As you mentioned "to get around the
> waiting jobs in q2" in your original post: do you want to suspend by
> hand?
>
> -- Reuti
>
> *) This brings up an idea for an RFE to get it working - anyone think
> it's useful?:
>
> limit q1*4,q2 hosts {*} to slots=8
>
>
>> --
>> David Olbersen
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, April 01, 2008 2:55 PM
>> To: David Olbersen
>> Subject: PM: Re: [GE users] Queue subordination and custom complexes
>>
>> Hey David,
>>
>> don't give up so early ;-) Just forget for a few minutes completely
>> about your complex.
>>
>> Am 01.04.2008 um 23:22 schrieb David Olbersen:
>>> Reuti,
>>>
>>>> So, contrary to your first post, you don't want to use  
>>>> subordination
>
>>>> any longer - where only one queue is active at a given point in  
>>>> time
>
>>>> and the others are suspended?
>>>
>>> That's not true at all!
>>>
>>> In the first post I describe my experiences trying to configure  
>>> queue
>
>>> subordination when exechost complexes are being used. My experience
>>> is
>>
>>> that this does not work -- jobs don't get suspended. I wondered out
>>> loud if maybe it was because the exechost complex wouldn't be
>>> considered "released" when the job was suspended.
>>>
>>> You replied suggesting I move these complexes from the exechosts to
>>> the queues.
>>>
>>> I replied trying to explain why that doesn't make sense to me: this
>>> complex is by definition host-specific. Moving the complex to the
>>> queue level would require a hardware homogenousness I don't have.
>>
>> Nope, here is nothing homogenous in our configuration I posted:
>>
>> slots                 2,[@p3-1100=1],[node10=1],[node02=1], 
>> [node03=1],
>> [node09=1]
>>
>> and to explain it for your configuration by using hostgroups or each
>> node:
>>
>> high.q:
>> slots                 1,[@quad_cores=4],[@dual_cores=2]
>> subordinate_list mid.q=1,low.q=1
>>
>> mid.q:
>> slots                 2,[@quad_cores=8],[@dual_cores=4]
>> subordinate_list low.q=1
>>
>> low.q:
>> slots                 4,[@quad_cores=16],[@dual_cores=8]
>> subordinate_list NONE
>>
>> No slot limit in any exec_host, no custom complexes.
>>
>> We are speaking here of cluster-queues, and for each host there will
>> be one queue-instance residing on a host. Each host in the hostgroup
>> get his own slot count, and even in a mixed cluster: each host get  
>> the
>
>> number of slot it deserves.
>>
>> -- Reuti
>>
>>
>>>
>>> Then you suggested that I change the number of slots on each
>>> exechost,
>>
>>> rather than using the complex I have set up.
>>>
>>> I replied suggesting that doesn't make sense to me since if I set  
>>> the
>
>>> slot count too high, I get more jobs on a machine than I want,  
>>> and if
>
>>> I set it too low I end up wasting resources.
>>>
>>> It sounds like this just isn't going to work. Thanks for your time
>>> and
>>
>>> effort.
>>>
>>> --
>>> David Olbersen
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, April 01, 2008 1:10 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>
>>> Am 01.04.2008 um 18:28 schrieb David Olbersen:
>>>> Reuti,
>>>>
>>>> We want to use a DOUBLE because we consider some of our jobs to use
>>>> less than a whole CPU. We have some jobs that need to run that  
>>>> never
>
>>>> do very much CPU processing at all. For example, we have one  
>>>> type of
>
>>>> job which we consider to use 1/4 of a CPU.
>>>>
>>>> The "smaller" jobs only request 1/4 of a CPU via "-l cores=0.25".
>>>> The
>>
>>>> queue these jobs run in has it's slot count set to 16 (4 cores * 4
>>>> jobs per core = 16). However, these machines may also be used by
>>>> queues which use whole, or even multiple CPUs. So in this  
>>>> situation,
>
>>>> what would I set the slots attribute to on this machine? 1? 4?
>>>> 16? It
>>
>>>> seems impossible to set it correctly -- if I set it to 16 I can  
>>>> have
>
>>>> an over-subscribed (by your definition) machine. If I set it to 4 I
>>>> can still have an over-subscribed machine if some multi-threaded
>>>> jobs
>>
>>>> come along. If I set it to 1 I'll end up wasting resources.
>>>
>>> So, contrary to your first post, you don't want to use subordination
>>> any longer - where only one queue is active at a given point in time
>>> and the others are suspended?
>>>
>>> -- Reuti
>>>
>>>
>>>> --
>>>> David Olbersen
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Tuesday, April 01, 2008 12:36 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>>
>>>> Am 01.04.2008 um 00:11 schrieb David Olbersen:
>>>>> Reuti,
>>>>>
>>>>>> What you can do: attach the resource to the queues, not to the
>>>>>> host.
>>>>>> Hence every queue supplies the specified amount per node on its
>>>>>> own.
>>>>>
>>>>> I think you're missing the idea. My "cores" complex is the same as
>>>>> the
>>>>
>>>>> "num_procs" except a DOUBLE instead of an INT. Specifying it on a
>>>>> per-queue basis isn't appropriate since I'm trying to over-
>>>>> subscribe
>>>
>>>>> my hosts. Also, my hosts have varying numbers of cores (2 or 4).
>>>>
>>>> It is appropriate, as it is the limit per queue instance in a queue
>>>> definition:
>>>>
>>>> slots                 2,[@p3-1100=1],[node10=1],[node02=1],
>>>> [node03=1],
>>>> [node09=1]
>>>>
>>>> But the term "over-subscribe" usually means to have more jobs
>>>> running
>>
>>>> at the same time than cores are in the machine. But it seems you
>>>> want
>>
>>>> to avoid over-subscription.
>>>>
>>>> Therefore you can also set "slots" in each exec hosts configuration
>>>> and both limits will apply per node (or even use an RQS for it). It
>>>> just fills the node form different queues and avoids
>>>> oversubscription.
>>>> But if
>>>> you want to use subordination (as you stated in your first post),
>>>> you
>>
>>>> mustn't specify it on a per node basis at all. Just set
>>>> "subordinate_list other.q=1" and other.q will get suspended as soon
>>>> as
>>>
>>>> one slot is used in the current queue.
>>>>
>>>> But I don't get the clue, why you want to have a DOUBLE for it.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> To elaborate: we want to give each job a whole CPU to play with.
>>>>> On a
>>>
>>>>> 4-processor machine that means only 4 jobs can run.
>>>>>
>>>>> However, to get the most utilization out of a machine, we may  
>>>>> allow
>
>>>>> many queues to run on it, to the point of having 8-12 slots total.
>>>>> However, if all 8 or 12 slots were full on the one machine, we'd
>>>>> have
>>>
>>>>> more jobs/CPU than we really want, causing all the jobs to slow
>>>>> down.
>>>>>
>>>>> To accommodate this situation, each job requires 1 "cores"
>>>>> consumable by
>>>>> default. This makes it such that any mixture of jobs from various
>>>>> queues can run on the machine, so long as there are still "cores"
>>>>> available. It
>>>>> also means that if a job is multi-threaded and needs all 4 cores,
>>>>> it
>>
>>>>> can request as much and consume an entire machine.
>>>>>
>>>>> For example: node-a has 4 CPUs and is in q1, q2, and q3. q1, q2,
>>>>> and
>>>>> q3 are set to put 4 slots on each machine they're on. This means
>>>>> that
>>>
>>>>> node-a has 12 slots, but only 4 cpus. I set its "cores" complex =
>>>>> 4.
>>>>> Now any combination of 4 jobs from queues q1, q2, and q3 can run.
>>>>> This
>>>>
>>>>> gets the most utilization out of the machine.
>>>>>
>>>>> So given that this resource has to remain at the node-level, are
>>>>> there
>>>>
>>>>> any ways to get around this? Maybe give the resource back when the
>>>>> job
>>>>
>>>>> gets suspended, then take it back when it gets resumed?
>>>>>
>>>>> --
>>>>> David Olbersen
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: Monday, March 31, 2008 10:37 AM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 31.03.2008 um 18:46 schrieb David Olbersen:
>>>>>> I have the following configuration in my lab cluster:
>>>>>>
>>>>>> Q1 runs on machines #1, #2, and #3.
>>>>>> Q2 runs on the same machines.
>>>>>> Q2 is configured to have Q1 as a subordinate.
>>>>>> All machines have 2GB of RAM.
>>>>>>
>>>>>> If I submit 3 jobs to Q1 and 3 to Q2, the expected results are
>>>>>> given: jobs start in Q1 (submitted first) then get suspended  
>>>>>> while
>
>>>>>> jobs in Q2 run.
>>>>>>
>>>>>> Awesome.
>>>>>>
>>>>>> Next I try specifying hard resource requirements by adding "-hard
>>>>>> - l
>>>>
>>>>>> mem_free=1.5G" to each job. This still ends up working out,
>>>>>> probably
>>>
>>>>>> because the jobs don't actually consume 1.5G of memory.
>>>>>> The jobs are simple things that drive up CPU utilization by  
>>>>>> dd'ing
>
>>>>>> from /dev/urandom out to /dev/null.
>>>>>>
>>>>>> Next, to further replicate my production environment I add a
>>>>>> custom
>>
>>>>>> complex named "cores" that gets set on a per-host basis to the
>>>>>> number
>>>>
>>>>>> of CPUs the machine has. Please note that we're not using
>>>>>> "num_proc"
>>>>>> because we want some jobs to use fractions of a CPU and num_proc
>>>>>> is
>>
>>>>>> an INT.
>>>>>>
>>>>>> So each job will take up 1 "core" and each job has 1 "core".
>>>>>> With this set up the jobs in Q1 run, and the jobs in Q2 wait. No
>>>>>> suspension happens at all. Is this because the host resource is
>>>>>> actually being consumed? Is there any way to get around this?
>>>>>
>>>>> yes, you can check the remaining amount of this complex with  
>>>>> "qhost
>>>>> -
>>>
>>>>> F cores". Or also per job: qstat -j <jobid> when "schedd_job_info
>>>>> true" in the scheduler setup). Be aware, that only complete queues
>>>>> can
>>>>
>>>>> be suspended, and not just some slots of them.
>>>>>
>>>>> What you can do: attach the resource to the queues, not to the
>>>>> host.
>>>>> Hence every queue supplies the specified amount per node on its
>>>>> own.
>>>>>
>>>>> (sidenote: to avoid requesting the resource all the time and
>>>>> specifying the correct queue in addition, you could also have two
>>>>> resources cores1 and cores2. attach cores1 to Q1 and likewise
>>>>> cores2.
>>>>> qsub -l cores2=1 will also get the Q2 queue).
>>>>>
>>>>> -- Reuti
>>>>> ------------------------------------------------------------------ 
>>>>> -
>>>>> -
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> -
>>>>> -
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list