[GE users] Queue subordination and custom complexes

David Olbersen dolbersen at nextwave.com
Mon Apr 7 21:32:01 BST 2008


Roberta,

That's awesome!
It's too bad I'm stuck at SGE 6.0u8 and can't upgrade to get RQS.
At least, I can't upgrade just yet. This is probably a big enough
motivator that I could put in for a little down time.

On that note, I noticed that in the upgrade instructions you're supposed
to stop all jobs and disable all the queues. Is there a way to upgrade
without having to do that? Some of our jobs run for months and it would
be unfortunate if we had to wait until they all finished before we could
upgrade the cluster. Is there some kind of phased approach that people
have had work in the past?

-- 
David Olbersen
 

-----Original Message-----
From: Roberta Gigon [mailto:RGigon at slb.com] 
Sent: Monday, April 07, 2008 1:00 PM
To: David Olbersen
Subject: RE: Re: [GE users] Queue subordination and custom complexes

Hi Dave,
I was able to get this set up using RQS.  I'm sure this can be done via
the command line, but I used the Resource Quotas Configuration button in
QMON...

Here is an example that perhaps will help you...

{
        name                    sdr_rule_1
        enabled         TRUE
        limit                   queues nuclear_hi.q,webmi_low.q hosts
{*} to slots=4
}

In the example above, the combination of jobs in nuclear_hi.q and
webmi_low.q can never use more than 4 slots on any host.

Hope this helps!
Roberta

------------------------------------------------------------------------
---------------------
Roberta M. Gigon
Schlumberger-Doll Research
One Hampshire Street, MD-B253
Cambridge, MA 02139
617.768.2099 - phone
617.768.2381 - fax

This message is considered Schlumberger CONFIDENTIAL.  Please treat the
information contained herein accordingly.


-----Original Message-----
From: David Olbersen [mailto:dolbersen at nextwave.com]
Sent: Monday, April 07, 2008 1:16 PM
To: users at gridengine.sunsource.net
Subject: RE: Re: [GE users] Queue subordination and custom complexes

Reuti,

So I've tried this on my lab cluster and see that I can set the number
of job slots as you say.
That looks pretty good, but there's still the problem of
oversubscription.

For example, node-1 is in the "@dualcores" hostgroup.
Q1 says:
        slots                 4,[@dualcores=2]
Q2 says:
        slots                 16,[@dualcores=8]

The problem is that the machine can end up running 10 jobs. That's not
how I need it to work.
Any of the following mixes would be OK:
2 jobs from q1, 0 from q2       (q1 is allowed to dominate)
0 jobs from q1, 8 from q2       (q2 is allowed to dominate)
1 job from q1, 4 from q2        (sharing)

Using just job slot tuning at the queue-cluster level I can end up with
2 jobs from q1, 8 from q2. That's too many.

Any suggestions?

Maybe the problem is that I'm trying to treat q1 and q2 as equals (no
job suspension) and that just won't work using this configuration.

--
David Olbersen


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Tuesday, April 01, 2008 2:55 PM
To: David Olbersen
Subject: PM: Re: [GE users] Queue subordination and custom complexes

Hey David,

don't give up so early ;-) Just forget for a few minutes completely
about your complex.

Am 01.04.2008 um 23:22 schrieb David Olbersen:
> Reuti,
>
>> So, contrary to your first post, you don't want to use subordination 
>> any longer - where only one queue is active at a given point in time 
>> and the others are suspended?
>
> That's not true at all!
>
> In the first post I describe my experiences trying to configure queue 
> subordination when exechost complexes are being used. My experience is

> that this does not work -- jobs don't get suspended. I wondered out 
> loud if maybe it was because the exechost complex wouldn't be 
> considered "released" when the job was suspended.
>
> You replied suggesting I move these complexes from the exechosts to 
> the queues.
>
> I replied trying to explain why that doesn't make sense to me: this 
> complex is by definition host-specific. Moving the complex to the 
> queue level would require a hardware homogenousness I don't have.

Nope, here is nothing homogenous in our configuration I posted:

slots                 2,[@p3-1100=1],[node10=1],[node02=1],[node03=1],
[node09=1]

and to explain it for your configuration by using hostgroups or each
node:

high.q:
slots                 1,[@quad_cores=4],[@dual_cores=2]
subordinate_list mid.q=1,low.q=1

mid.q:
slots                 2,[@quad_cores=8],[@dual_cores=4]
subordinate_list low.q=1

low.q:
slots                 4,[@quad_cores=16],[@dual_cores=8]
subordinate_list NONE

No slot limit in any exec_host, no custom complexes.

We are speaking here of cluster-queues, and for each host there will be
one queue-instance residing on a host. Each host in the hostgroup get
his own slot count, and even in a mixed cluster: each host get the
number of slot it deserves.

-- Reuti


>
> Then you suggested that I change the number of slots on each exechost,

> rather than using the complex I have set up.
>
> I replied suggesting that doesn't make sense to me since if I set the 
> slot count too high, I get more jobs on a machine than I want, and if 
> I set it too low I end up wasting resources.
>
> It sounds like this just isn't going to work. Thanks for your time and

> effort.
>
> --
> David Olbersen
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, April 01, 2008 1:10 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Queue subordination and custom complexes
>
> Am 01.04.2008 um 18:28 schrieb David Olbersen:
>> Reuti,
>>
>> We want to use a DOUBLE because we consider some of our jobs to use 
>> less than a whole CPU. We have some jobs that need to run that never 
>> do very much CPU processing at all. For example, we have one type of 
>> job which we consider to use 1/4 of a CPU.
>>
>> The "smaller" jobs only request 1/4 of a CPU via "-l cores=0.25". The

>> queue these jobs run in has it's slot count set to 16 (4 cores * 4 
>> jobs per core = 16). However, these machines may also be used by 
>> queues which use whole, or even multiple CPUs. So in this situation, 
>> what would I set the slots attribute to on this machine? 1? 4? 16? It

>> seems impossible to set it correctly -- if I set it to 16 I can have 
>> an over-subscribed (by your definition) machine. If I set it to 4 I 
>> can still have an over-subscribed machine if some multi-threaded jobs

>> come along. If I set it to 1 I'll end up wasting resources.
>
> So, contrary to your first post, you don't want to use subordination 
> any longer - where only one queue is active at a given point in time 
> and the others are suspended?
>
> -- Reuti
>
>
>> --
>> David Olbersen
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, April 01, 2008 12:36 AM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Queue subordination and custom complexes
>>
>> Am 01.04.2008 um 00:11 schrieb David Olbersen:
>>> Reuti,
>>>
>>>> What you can do: attach the resource to the queues, not to the 
>>>> host.
>>>> Hence every queue supplies the specified amount per node on its 
>>>> own.
>>>
>>> I think you're missing the idea. My "cores" complex is the same as 
>>> the
>>
>>> "num_procs" except a DOUBLE instead of an INT. Specifying it on a 
>>> per-queue basis isn't appropriate since I'm trying to over- 
>>> subscribe
>
>>> my hosts. Also, my hosts have varying numbers of cores (2 or 4).
>>
>> It is appropriate, as it is the limit per queue instance in a queue
>> definition:
>>
>> slots                 2,[@p3-1100=1],[node10=1],[node02=1],
>> [node03=1],
>> [node09=1]
>>
>> But the term "over-subscribe" usually means to have more jobs running

>> at the same time than cores are in the machine. But it seems you want

>> to avoid over-subscription.
>>
>> Therefore you can also set "slots" in each exec hosts configuration 
>> and both limits will apply per node (or even use an RQS for it). It 
>> just fills the node form different queues and avoids 
>> oversubscription.
>> But if
>> you want to use subordination (as you stated in your first post), you

>> mustn't specify it on a per node basis at all. Just set 
>> "subordinate_list other.q=1" and other.q will get suspended as soon 
>> as
>
>> one slot is used in the current queue.
>>
>> But I don't get the clue, why you want to have a DOUBLE for it.
>>
>> -- Reuti
>>
>>
>>> To elaborate: we want to give each job a whole CPU to play with.
>>> On a
>
>>> 4-processor machine that means only 4 jobs can run.
>>>
>>> However, to get the most utilization out of a machine, we may allow 
>>> many queues to run on it, to the point of having 8-12 slots total.
>>> However, if all 8 or 12 slots were full on the one machine, we'd 
>>> have
>
>>> more jobs/CPU than we really want, causing all the jobs to slow 
>>> down.
>>>
>>> To accommodate this situation, each job requires 1 "cores"
>>> consumable by
>>> default. This makes it such that any mixture of jobs from various 
>>> queues can run on the machine, so long as there are still "cores"
>>> available. It
>>> also means that if a job is multi-threaded and needs all 4 cores, it

>>> can request as much and consume an entire machine.
>>>
>>> For example: node-a has 4 CPUs and is in q1, q2, and q3. q1, q2, and
>>> q3 are set to put 4 slots on each machine they're on. This means 
>>> that
>
>>> node-a has 12 slots, but only 4 cpus. I set its "cores" complex = 4.
>>> Now any combination of 4 jobs from queues q1, q2, and q3 can run.
>>> This
>>
>>> gets the most utilization out of the machine.
>>>
>>> So given that this resource has to remain at the node-level, are 
>>> there
>>
>>> any ways to get around this? Maybe give the resource back when the 
>>> job
>>
>>> gets suspended, then take it back when it gets resumed?
>>>
>>> --
>>> David Olbersen
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Monday, March 31, 2008 10:37 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Queue subordination and custom complexes
>>>
>>> Hi,
>>>
>>> Am 31.03.2008 um 18:46 schrieb David Olbersen:
>>>> I have the following configuration in my lab cluster:
>>>>
>>>> Q1 runs on machines #1, #2, and #3.
>>>> Q2 runs on the same machines.
>>>> Q2 is configured to have Q1 as a subordinate.
>>>> All machines have 2GB of RAM.
>>>>
>>>> If I submit 3 jobs to Q1 and 3 to Q2, the expected results are
>>>> given: jobs start in Q1 (submitted first) then get suspended while 
>>>> jobs in Q2 run.
>>>>
>>>> Awesome.
>>>>
>>>> Next I try specifying hard resource requirements by adding "-hard
>>>> - l
>>
>>>> mem_free=1.5G" to each job. This still ends up working out, 
>>>> probably
>
>>>> because the jobs don't actually consume 1.5G of memory.
>>>> The jobs are simple things that drive up CPU utilization by dd'ing 
>>>> from /dev/urandom out to /dev/null.
>>>>
>>>> Next, to further replicate my production environment I add a custom

>>>> complex named "cores" that gets set on a per-host basis to the 
>>>> number
>>
>>>> of CPUs the machine has. Please note that we're not using 
>>>> "num_proc"
>>>> because we want some jobs to use fractions of a CPU and num_proc is

>>>> an INT.
>>>>
>>>> So each job will take up 1 "core" and each job has 1 "core".
>>>> With this set up the jobs in Q1 run, and the jobs in Q2 wait. No 
>>>> suspension happens at all. Is this because the host resource is 
>>>> actually being consumed? Is there any way to get around this?
>>>
>>> yes, you can check the remaining amount of this complex with "qhost
>>> -
>
>>> F cores". Or also per job: qstat -j <jobid> when "schedd_job_info 
>>> true" in the scheduler setup). Be aware, that only complete queues 
>>> can
>>
>>> be suspended, and not just some slots of them.
>>>
>>> What you can do: attach the resource to the queues, not to the host.
>>> Hence every queue supplies the specified amount per node on its own.
>>>
>>> (sidenote: to avoid requesting the resource all the time and 
>>> specifying the correct queue in addition, you could also have two 
>>> resources cores1 and cores2. attach cores1 to Q1 and likewise 
>>> cores2.
>>> qsub -l cores2=1 will also get the Q2 queue).
>>>
>>> -- Reuti
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list