[GE users] Jobs running but not using resources

Reuti reuti at staff.uni-marburg.de
Sat Nov 1 22:49:40 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Hugo,

Am 01.11.2008 um 20:29 schrieb Hugo Hernandez-Mora:

> Reuti,
> The SGE version we are using is 6.1u4.  You are right about the  
> rule for the swap_total.  I have removed it because in really what  
> we are looking is to prevent the users use the 100% of the memory  
> on each compute node.  Now, regarding the output for the command  
> qquota, here is what I have:
>
>
> myhost> qquota -u "*"
> resource quota rule limit                filter
> ---------------------------------------------------------------------- 
> ----------
> memory_usage/1     mem_total=7g         users {*} hosts {@v20zHosts}
> memory_usage/2     mem_total=15g        users {*} hosts {@x2200Hosts}

unless you made mem_total consumable (which the above output doesn't  
imply), this is a limit per job, which can be requested in the qsub.  
It won't enforce any limit for running jobs though, nor will it be  
added up.

What you can do to have enforced limits:

- make h_vmem in the complex configuration consumable and give a  
proper default value there
- attach h_vmem to each exechost under complex_values an set it equal  
to the installed memory (or a little bit less, when youn want to save  
some memory for the os)
- define an rqs

    limit        name memory users {*} hosts {*} to h_vmem=15g

for e.g. 32gb installed. Then you can request: qsub -l h_vmem=2g ...

(if you only want to save some memory for the OS, you don't need an  
RQS at all; just specify a little bit less for each exechost)

You can do the same with virtual_free instead of h_vmem, if you trust  
your users that they won't exceed the requested memory. Depending on  
your working style, the unconditional kill in case of one byte too  
much might not be worth pursuing.

(Regarding the manpage sge_resource_quota, the information that it  
must be consumable is missing. I'll file an issue.)

> max_per_queue/1    slots=7/672          users user1 queues short.q
> max_per_queue/1    slots=26/672         users user2 queues short.q
> max_per_queue/1    slots=1/672          users user3 queues short.q
> max_per_queue/1    slots=46/672         users user4 queues short.q
> max_per_queue/2    slots=2/192          users user5 queues medium.q
> max_per_queue/2    slots=2/192          users user1 queues medium.q
> max_per_queue/2    slots=1/192          users user6 queues medium.q
> max_per_queue/3    slots=2/111          users user7 queues long.q
> max_per_queue/3    slots=1/111          users user8 queues long.q
> max_per_queue/3    slots=1/111          users user1 queues long.q
> max_per_queue/3    slots=12/111         users user7 queues long.q
> max_per_queue/3    slots=45/111         users user4 queues long.q
> max_per_queue/4    slots=109/1810       users user0 queues queue0.q
>
> user5 and user7 are running array jobs and them are the ones  
> reporting very low CPU usage.

This is nothing SGE can change I think. I would suggest to run the  
applications interactively and check how they behave, and then do the  
same inside the cluster.

-- Reuti

> Thanks for your time and your help!!!!
> -Hugo
>
> --
> Hugo R. Hernandez-Mora, M.Sc.
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Fax: 310.206.5518
> hugo.hernandez at loni.ucla.edu
> --
>
> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> que o sol faze un espectacolo maravilhoso todas as manhãs
> cuando a maior parte das pessoas, ainda estam durmindo"
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, October 30, 2008 5:28 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Jobs running but not using resources
>
> Hi,
>
> Am 30.10.2008 um 22:51 schrieb Hugo Hernandez-Mora:
>
>> Hello all,
>> We are experiencing a strange behavior in our cluster since the
>> last weekend.  Most of the jobs running into our cluster (we have
>> +300 SunFire 20Vz and 80 SunFIre X2200 with +3,500 available slots)
>> are not using the resources as expected.   Indeed, most of them are
>> not using the resources (0 CPU for the associated processes).
>
> which SGE version?
>
> You mean jobs are scheduled put doing nothing? Or aren't the jobs
> scheduled at all?
>
>> We have set the following resource limits:
>>
>> {
>>    name         memory_usage
>>    description  Limit the memory used for all users (per machine  
>> type)
>>    enabled      TRUE
>>    limit        users {*} hosts {@v20zHosts} to mem_total=7g
>>    limit        users {*} hosts {@x2200Hosts} to mem_total=15g
>>    limit        users {*} to swap_total=10g
>
> I'm puzzled about this last rule. Are you requesting swap_total for
> the jobs? If one of the former rules allow execution of the job, the
> follow-up rules won't be checked at all.
>
>> }
>> {
>>    name         sysadm_rule
>>    description  Restrict user user1 to use only 50 slots in
>> queue0.q queue
>>    enabled      TRUE
>>    limit        users {user1} queues queue0.q to slots=50
>> }
>> {
>>    name         max_per_queue
>>    description  Limit the maximum allowed cluster queue slots per  
>> user
>>    enabled      TRUE
>>    limit        users {*} queues short.q to slots=672
>>    limit        users {*} queues medium.q to slots=192
>>    limit        users {*} queues long.q to slots=111
>>    limit        users {*} queues special.q to slots=1810
>> }
>>
>> For the last limit, the max_per_queue, we are restricting the users
>> to use all the available slots on the queues, preventing to
>> monopolize the resources of the cluster.   The total of available
>> slots per queue is:
>>
>> myhost> qstat -g c
>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>> cdsuE
>> --------------------------------------------------------------------- 
>> -
>> ---------
>> long.q                            0.48    185      0    240
>> 41     32
>> medium.q                          0.48      5     59    330
>> 230     40
>> special.q                        0.57    134   1741   2190
>> 10    325
>> short.q                           0.48    986      4   1140
>> 24    142
>> queue0.q                          3.14    185      0    185
>> 185      0
>>
>> we have not done any changes on our configuration.  Does any of you
>> have experienced a similar problems or can you just give me some
>> hints about what to check?  Any help will be greatly appreciated.
>> Thanks in advance,
>
> Is there any helpful output in the command:
>
> $ qquota -u "*"
>
> BTW: Giving the rules names might make the output easier to read.
>
> -- Reuti
>
>
>>
>> -Hugo
>>
>> --
>> Hugo R. Hernandez-Mora
>> System Administrator
>> Laboratory of Neuro Imaging, UCLA
>> 635 Charles E. Young Drive South, Suite 225
>> Los Angeles, CA 90095-7332
>> Tel: 310.267.5076
>> Fax: 310.206.5518
>> hugo.hernandez at loni.ucla.edu
>> --
>>
>> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
>> que o sol faze un espectacolo maravilhoso todas as manhãs
>> cuando a maior parte das pessoas, ainda estam durmindo"
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list