[GE users] Jobs running but not using resources

reuti reuti at staff.uni-marburg.de
Thu Nov 6 00:22:16 GMT 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Hugo,

Am 06.11.2008 um 00:09 schrieb Hugo Hernandez-Mora:

> Reuti,
> I followed your suggestions about the memory limits and we came  
> into another problem.  Here is what I did:
>
> o set h_vmem as consumable with a default value of 1.5G,
> o set complex_values h_vmem=15G on the execution hosts,
> o set h_vmem=5G in the queue configuration,
>
> as you can see, we are reserving 1G for the OS,

IMO this is far too much. I even tend to set on the execution hosts  
h_vmem=built-in memory. If really all applications would use the  
granted memory (which is unlikely with a job mix), the OS will swap a  
little and that's all. But you can of course let it set to 15 GB, and  
observe over time whether you could use the remaining 1 GB also for  
applications.


> and this configuration restrict SGE to run a maximum of 6 jobs into  
> the execution hosts.  That's fine for us, but that is a problem for  
> jobs needing more than 1.5G of memory.   Okay, we can use for these  
> jobs the complex included into the queue configuration, but we  
> don't have sophisticated users in the way they have to specify the  
> amount of memory their applications will require to run.

Often this is done by a wrapper to submit the jobs, i.e. generating a  
jobscript on-the-fly and submitting it. You could either request one  
single option by the user to define a different limit, or have just  
two scripts (in Linux often one script is used, but called by two  
different links to it and thus changing the behavior of the script by  
the name it's called), e.g. "submit" and "submit-big".


>   Indeed, we are using an application which submits jobs into the  
> cluster by using DRMAA and most of the users don't have any idea  
> how their jobs go and run into the cluster.

Great, then you could use this already to request a higher limit.


>   If we set a default request for the memory limit, we will be  
> limiting our cluster in the resources usage.

The limit is 1.5 GB in the above setup. When the job crashes as is  
needs more memory, the user has to submit it again with a flag to  
your wrapper or a different script. Over time they will learn, which  
types of jobs need more memory. If the tend to submit unnecessarily  
all jobs with the higher limit, they will have to wait longer as  
fewer jobs can run in the cluster at a time.


> What we want to do is just to reserve like 1G of memory for the OS  
> and share the rest of memory for all the jobs running into the  
> execution host.  Is there a way to set this?

Until the crystal ball feature will appear in a future release of  
SGE: no. Or well: yes. You could implement a co-scheduler for SGE,  
which will mimic the behavior of the OOM-killer (out-of-memory- 
killer) in the Linux kernel which kills processes when all memory is  
used up. Just let all jobs use all the memory they like. If the co- 
scheduler discovers that the node is running out of memory: kill one  
job (i.e. qdel) by an advanced algorithm (the user has to resubmit  
the job - maybe next time he has more luck). Then wait until the  
condition rises again in the co-scheduler. IMO this isn't worth  
pursuing.


>   We would like to use as much resources as we can but protect our  
> system from crashes.

The system will not crash, only the jobs will crash when they didn't  
get enough memory. You defined some scratch space for the worst case?

As example: sometimes it's also imporant to request space on a  
scratch disk in a node. Otherwise the jobs would block each other  
when they fill up the local disk. And also here SGE needs some  
information, as with other resources too - it's the purpose of  
resource requests to provide information to SGE about the job. There  
is no built-in prediction for any of them.

-- Reuti


> Thanks,
> -Hugo
>
> --
> Hugo R. Hernandez-Mora
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Fax: 310.206.5518
> hugo.hernandez at loni.ucla.edu
> --
>
> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> que o sol faze un espectacolo maravilhoso todas as manhãs
> cuando a maior parte das pessoas, ainda estam durmindo"
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Monday, November 03, 2008 3:09 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Jobs running but not using resources
>
> Hi Hugo,
>
> Am 03.11.2008 um 19:19 schrieb Hugo Hernandez-Mora:
>
>> Reuti,
>> What we want to do is to save some memory for the OS.
>
> as SGE can't predict the memory requirements of the to be scheduled
> jobs, it needs some information about it. You can set a default value
> like it's often done, by putting either a value in the "default"
> column of the complex configuration (qconf -mc), or specifying a
> default request for each job in $SGE_ROOT/default/common/sge_request
> (i.e. line: -l h_vmem=2g).
>
> For this setup to work, h_vmem must be consumable and getting a
> sensible value attached for each exechost. Without this setup, it's
> otherwise only a limit per job.
>
> EXAMPLE: suppose you have a 16GB node.
>
> - You want a default of 2GB per job, so you make h_vmem consumable
> (yes in the column "consumable" in qconf -mc) and specify 2g (column
> default) in the complex configuration.
>
> - You want SGE to use only 15.5GB of the installed memory, so you
> need: "complex_values   h_vmem=15.5G" in each exec host's
> configuration (qconf -me <nodename>).
>
> - For special applications, you will allow the user to request up to
> 10g in the node. For this you need to set in the queue configuration
> "h_vmem   10g". This will be the maximum per job.
>
> So evey job will get a default of 2GB, until the 15.5GB are used up
> in the node. Then no further jobs will be scheduled thereto. If a
> user needs more, he can request it by e.g. "qsub -l h_vmem=8G". The
> limit for this would be the defined 10g in the queue configuration.
>
> h_vmem is an enforced limit, means: if the application of the user
> needs one byte more, the job will be killed.
>
> -- Reuti
>
>
>> But what we don't want to do is to tell the users they have to
>> request the amount of memory for their jobs by using a complex
>> value for v_hmem.  Can I set the complex value to be a default not
>> needed to be set by the users when submitting jobs into the cluster?
>> Thanks for all your help.
>> -Hugo
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Saturday, November 01, 2008 5:50 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Jobs running but not using resources
>>
>> Hi Hugo,
>>
>> Am 01.11.2008 um 20:29 schrieb Hugo Hernandez-Mora:
>>
>>> Reuti,
>>> The SGE version we are using is 6.1u4.  You are right about the
>>> rule for the swap_total.  I have removed it because in really what
>>> we are looking is to prevent the users use the 100% of the memory
>>> on each compute node.  Now, regarding the output for the command
>>> qquota, here is what I have:
>>>
>>>
>>> myhost> qquota -u "*"
>>> resource quota rule limit                filter
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ----------
>>> memory_usage/1     mem_total=7g         users {*} hosts {@v20zHosts}
>>> memory_usage/2     mem_total=15g        users {*} hosts  
>>> {@x2200Hosts}
>>
>> unless you made mem_total consumable (which the above output doesn't
>> imply), this is a limit per job, which can be requested in the qsub.
>> It won't enforce any limit for running jobs though, nor will it be
>> added up.
>>
>> What you can do to have enforced limits:
>>
>> - make h_vmem in the complex configuration consumable and give a
>> proper default value there
>> - attach h_vmem to each exechost under complex_values an set it equal
>> to the installed memory (or a little bit less, when youn want to save
>> some memory for the os)
>> - define an rqs
>>
>>     limit        name memory users {*} hosts {*} to h_vmem=15g
>>
>> for e.g. 32gb installed. Then you can request: qsub -l h_vmem=2g ...
>>
>> (if you only want to save some memory for the OS, you don't need an
>> RQS at all; just specify a little bit less for each exechost)
>>
>> You can do the same with virtual_free instead of h_vmem, if you trust
>> your users that they won't exceed the requested memory. Depending on
>> your working style, the unconditional kill in case of one byte too
>> much might not be worth pursuing.
>>
>> (Regarding the manpage sge_resource_quota, the information that it
>> must be consumable is missing. I'll file an issue.)
>>
>>> max_per_queue/1    slots=7/672          users user1 queues short.q
>>> max_per_queue/1    slots=26/672         users user2 queues short.q
>>> max_per_queue/1    slots=1/672          users user3 queues short.q
>>> max_per_queue/1    slots=46/672         users user4 queues short.q
>>> max_per_queue/2    slots=2/192          users user5 queues medium.q
>>> max_per_queue/2    slots=2/192          users user1 queues medium.q
>>> max_per_queue/2    slots=1/192          users user6 queues medium.q
>>> max_per_queue/3    slots=2/111          users user7 queues long.q
>>> max_per_queue/3    slots=1/111          users user8 queues long.q
>>> max_per_queue/3    slots=1/111          users user1 queues long.q
>>> max_per_queue/3    slots=12/111         users user7 queues long.q
>>> max_per_queue/3    slots=45/111         users user4 queues long.q
>>> max_per_queue/4    slots=109/1810       users user0 queues queue0.q
>>>
>>> user5 and user7 are running array jobs and them are the ones
>>> reporting very low CPU usage.
>>
>> This is nothing SGE can change I think. I would suggest to run the
>> applications interactively and check how they behave, and then do the
>> same inside the cluster.
>>
>> -- Reuti
>>
>>> Thanks for your time and your help!!!!
>>> -Hugo
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Thursday, October 30, 2008 5:28 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Jobs running but not using resources
>>>
>>> Hi,
>>>
>>> Am 30.10.2008 um 22:51 schrieb Hugo Hernandez-Mora:
>>>
>>>> Hello all,
>>>> We are experiencing a strange behavior in our cluster since the
>>>> last weekend.  Most of the jobs running into our cluster (we have
>>>> +300 SunFire 20Vz and 80 SunFIre X2200 with +3,500 available slots)
>>>> are not using the resources as expected.   Indeed, most of them are
>>>> not using the resources (0 CPU for the associated processes).
>>>
>>> which SGE version?
>>>
>>> You mean jobs are scheduled put doing nothing? Or aren't the jobs
>>> scheduled at all?
>>>
>>>> We have set the following resource limits:
>>>>
>>>> {
>>>>    name         memory_usage
>>>>    description  Limit the memory used for all users (per machine
>>>> type)
>>>>    enabled      TRUE
>>>>    limit        users {*} hosts {@v20zHosts} to mem_total=7g
>>>>    limit        users {*} hosts {@x2200Hosts} to mem_total=15g
>>>>    limit        users {*} to swap_total=10g
>>>
>>> I'm puzzled about this last rule. Are you requesting swap_total for
>>> the jobs? If one of the former rules allow execution of the job, the
>>> follow-up rules won't be checked at all.
>>>
>>>> }
>>>> {
>>>>    name         sysadm_rule
>>>>    description  Restrict user user1 to use only 50 slots in
>>>> queue0.q queue
>>>>    enabled      TRUE
>>>>    limit        users {user1} queues queue0.q to slots=50
>>>> }
>>>> {
>>>>    name         max_per_queue
>>>>    description  Limit the maximum allowed cluster queue slots per
>>>> user
>>>>    enabled      TRUE
>>>>    limit        users {*} queues short.q to slots=672
>>>>    limit        users {*} queues medium.q to slots=192
>>>>    limit        users {*} queues long.q to slots=111
>>>>    limit        users {*} queues special.q to slots=1810
>>>> }
>>>>
>>>> For the last limit, the max_per_queue, we are restricting the users
>>>> to use all the available slots on the queues, preventing to
>>>> monopolize the resources of the cluster.   The total of available
>>>> slots per queue is:
>>>>
>>>> myhost> qstat -g c
>>>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>>>> cdsuE
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> -
>>>> ---------
>>>> long.q                            0.48    185      0    240
>>>> 41     32
>>>> medium.q                          0.48      5     59    330
>>>> 230     40
>>>> special.q                        0.57    134   1741   2190
>>>> 10    325
>>>> short.q                           0.48    986      4   1140
>>>> 24    142
>>>> queue0.q                          3.14    185      0    185
>>>> 185      0
>>>>
>>>> we have not done any changes on our configuration.  Does any of you
>>>> have experienced a similar problems or can you just give me some
>>>> hints about what to check?  Any help will be greatly appreciated.
>>>> Thanks in advance,
>>>
>>> Is there any helpful output in the command:
>>>
>>> $ qquota -u "*"
>>>
>>> BTW: Giving the rules names might make the output easier to read.
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> -Hugo
>>>>
>>>> --
>>>> Hugo R. Hernandez-Mora
>>>> System Administrator
>>>> Laboratory of Neuro Imaging, UCLA
>>>> 635 Charles E. Young Drive South, Suite 225
>>>> Los Angeles, CA 90095-7332
>>>> Tel: 310.267.5076
>>>> Fax: 310.206.5518
>>>> hugo.hernandez at loni.ucla.edu
>>>> --
>>>>
>>>> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
>>>> que o sol faze un espectacolo maravilhoso todas as manhãs
>>>> cuando a maior parte das pessoas, ainda estam durmindo"
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=88146
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88147

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list