[GE users] Strange problem with resource quotas in 6.2u5

icaci hristo at mc.phys.uni-sofia.bg
Mon Mar 8 10:03:46 GMT 2010


Hi,

On 08.03.2010, at 11:27, reuti wrote:

> Hi,
> 
> Am 08.03.2010 um 10:08 schrieb icaci:
> 
>> Hi, Reuti,
>> 
>> On 08.03.2010, at 01:43, reuti wrote:
>> 
>>> Hi,
>>> 
>>> Am 07.03.2010 um 23:56 schrieb icaci:
>>> 
>>>> Hello all!
>>>> 
>>>> I'm witnessing some odd behaviour of the resource quotas subsystem
>>>> of our 6.2u5 installation. We have two types of queues, each one in
>>>> both parallel and batch flavour:
>>>> - for long running jobs (p_long.q and b_long.q);
>>>> - for jobs with h_rt up to 48 hours (p_med.q and b_med.q).
>>> 
>>> what is your current setting of queue_sort_method in the scheduler
>>> confiuration?
>>> 
>> 
>> queue_sort_method is set to seqno and *_med.q's get properly  
>> selected for jobs with h_rt < 48:0:0 because of their lower sequence  
>> numbers compared to *_long.q.
> 
> I think you are aware of the fact, that this will allow med jobs also  
> to run in the long queues when all med slots are full, as there is no  
> upper limit for any resource in the queue definition.
> 

I am well aware of that fact. My intention is to cap the long running jobs to 48 slots. It's ok if shorter jobs consume slots from long.q although I don't think it could happen since each queue spans the entire cluster and resource quotas limit the combined usage of slots from med.q and long.q.

> I asked for the queue_sort_method, as there is an issue which also  
> just hit a friend of mine and I still cannot put it into an exact  
> phrase:
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2538
> 
> 
>>>> I want to limit our users to 64 slots in total but give them only 48
>>>> slots for long running jobs so I've set up the following resource
>>>> quota ruleset:
>>>> 
>>>> {
>>>> name         users
>>>> description  Limits imposed on ordinary users
>>>> enabled      TRUE
>>>> limit        name long users {*} queues *_long.q to slots=48
>>>> limit        name total users {*} to slots=64
>>>> }
>>> 
>>> I think it must be put into two RQS. If you put it into one RQS, you
>>> can get 48 slots for jobs in*_long.q plus 64 slots for jobs not
>>> running in any *_long.q. Only the first rule which fits the condition
>>> is checked. Then the job is either accepted or refused.
>>> 
>>> -- Reuti
>>> 
>> 
>> We have an additional quota set that limits each project in the same  
>> manner as we limit each user. I've split all rulesets into separate  
>> RQS and now qquota shows that limits work as expected, both per user  
>> and per project. I also see no objections for exceeded limits in the  
>> output of qstat -j for the sample job. There are no free slots at  
>> that time so I'm not able to test and see if it works.
> 
> Great. If it's still ignoring for some machines/slots, try to set  
> queue_sort_method to load and back afterwards, to check whether you  
> are also facing the mentioned effect.
> 
> -- Reuti
> 

Resources just become available and I was able to conduct some tests. The results are beyond my comprehension. What I did was to submit the same sleeper parallel job which requires 56 slots and h_rt=47:59:59. The RQS in place are:

{
   name         usr_long
   description  Limits imposed on ordinary users
   enabled      TRUE
   limit        users {*} queues *_long.q to slots=48
}
{
   name         usr_med+long
   description  Limits imposed on ordinary users
   enabled      TRUE
   limit        users {*} queues *_med.q,*_long.q to slots=64
}

This setup works and the job ends in p_med.q as expected. But if I change usr_med+long to
  limit users {*} queues * to slots=64
or to
  limit users {*} queues *.q to slots=64
or just to
  limit users {*} to slots=64
I get
  cannot run because it exceeds limit "hristo/////" in rule "usr_med+long/1"

It might be connected somehow to issue 2538 but changing queue_sort_order to load does not make the job run.

Either I don't understand how RQS filter matching works or I should do some debugging. I should stick to specifying the full list of queues which does the trick for now.

Thanks for your time,

Hristo

> 
>> Best regards,
>> 
>> Hristo
>> 
>>> 
>>>> But when I try to submit a simple 56-slot parallel job with
>>>> something like:
>>>> 
>>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>> 
>>>> the job stays in "qw" state and qstat shows the following:
>>>> ...
>>>> cannot run because it exceeds limit "hristo/////" in rule "users/
>>>> total"
>>>> cannot run because it exceeds limit "hristo/////" in rule "users/ 
>>>> long"
>>>> ...
>>>> The 56 slots requirement clearly exceeds the 48 slots limit from the
>>>> "users/long" rule, but for some obscure reason SGE thinks that it
>>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>> 
>>>> I tried to split the ruleset into two separate rules:
>>>> 
>>>> {
>>>> name         users_long
>>>> description  Limits imposed on ordinary users
>>>> enabled      TRUE
>>>> limit        users {*} queues *_long.q to slots=48
>>>> }
>>>> {
>>>> name         users_total
>>>> description  Limits imposed on ordinary users
>>>> enabled      TRUE
>>>> limit        users {*} to slots=64
>>>> }
>>>> 
>>>> Still no luck:
>>>> ...
>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>> "users_total/1"
>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>> "users_total/1"
>>>> cannot run because it exceeds limit "hristo/////" in rule
>>>> "users_total/1"
>>>> ...
>>>> 
>>>> The job runs fine if I disable the users_total rule.
>>>> 
>>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a colleague
>>>> of mine insists that he was able to run 56-slots jobs before the
>>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the point
>>>> in setting up my resource quotas?
>>>> 
>>>> And help would be greatly appreciated.
>>>> 
>>>> Hristo
>>>> --
>>>> Dr Hristo Iliev
>>>> Monte Carlo research group
>>>> Faculty of Physics, University of Sofia
>>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247462
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>>>> ].
>>>> 
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247465
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>>> ].
>> 
>> --
>> Dr Hristo Iliev
>> Monte Carlo research group
>> Faculty of Physics, University of Sofia
>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>> http://cluster.phys.uni-sofia.bg/hristo/
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247501
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247503
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

--
Dr Hristo Iliev
Monte Carlo research group
Faculty of Physics, University of Sofia
5 James Bourchier blvd, 1164 Sofia, Bulgaria
http://cluster.phys.uni-sofia.bg/hristo/

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247506

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list