[GE users] Strange problem with resource quotas in 6.2u5

reuti reuti at staff.uni-marburg.de
Mon Mar 8 09:27:17 GMT 2010


Hi,

Am 08.03.2010 um 10:08 schrieb icaci:

> Hi, Reuti,
>
> On 08.03.2010, at 01:43, reuti wrote:
>
>> Hi,
>>
>> Am 07.03.2010 um 23:56 schrieb icaci:
>>
>>> Hello all!
>>>
>>> I'm witnessing some odd behaviour of the resource quotas subsystem
>>> of our 6.2u5 installation. We have two types of queues, each one in
>>> both parallel and batch flavour:
>>> - for long running jobs (p_long.q and b_long.q);
>>> - for jobs with h_rt up to 48 hours (p_med.q and b_med.q).
>>
>> what is your current setting of queue_sort_method in the scheduler
>> confiuration?
>>
>
> queue_sort_method is set to seqno and *_med.q's get properly  
> selected for jobs with h_rt < 48:0:0 because of their lower sequence  
> numbers compared to *_long.q.

I think you are aware of the fact, that this will allow med jobs also  
to run in the long queues when all med slots are full, as there is no  
upper limit for any resource in the queue definition.

I asked for the queue_sort_method, as there is an issue which also  
just hit a friend of mine and I still cannot put it into an exact  
phrase:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2538


>>> I want to limit our users to 64 slots in total but give them only 48
>>> slots for long running jobs so I've set up the following resource
>>> quota ruleset:
>>>
>>> {
>>> name         users
>>> description  Limits imposed on ordinary users
>>> enabled      TRUE
>>> limit        name long users {*} queues *_long.q to slots=48
>>> limit        name total users {*} to slots=64
>>> }
>>
>> I think it must be put into two RQS. If you put it into one RQS, you
>> can get 48 slots for jobs in*_long.q plus 64 slots for jobs not
>> running in any *_long.q. Only the first rule which fits the condition
>> is checked. Then the job is either accepted or refused.
>>
>> -- Reuti
>>
>
> We have an additional quota set that limits each project in the same  
> manner as we limit each user. I've split all rulesets into separate  
> RQS and now qquota shows that limits work as expected, both per user  
> and per project. I also see no objections for exceeded limits in the  
> output of qstat -j for the sample job. There are no free slots at  
> that time so I'm not able to test and see if it works.

Great. If it's still ignoring for some machines/slots, try to set  
queue_sort_method to load and back afterwards, to check whether you  
are also facing the mentioned effect.

-- Reuti


> Best regards,
>
> Hristo
>
>>
>>> But when I try to submit a simple 56-slot parallel job with
>>> something like:
>>>
>>> echo "sleep 30" | qsub -pe ompix8 56 -l h_rt=47:59:59
>>>
>>> the job stays in "qw" state and qstat shows the following:
>>> ...
>>> cannot run because it exceeds limit "hristo/////" in rule "users/
>>> total"
>>> cannot run because it exceeds limit "hristo/////" in rule "users/ 
>>> long"
>>> ...
>>> The 56 slots requirement clearly exceeds the 48 slots limit from the
>>> "users/long" rule, but for some obscure reason SGE thinks that it
>>> also exceeds the 64-slots limit from the "users/total" rule.
>>>
>>> I tried to split the ruleset into two separate rules:
>>>
>>> {
>>> name         users_long
>>> description  Limits imposed on ordinary users
>>> enabled      TRUE
>>> limit        users {*} queues *_long.q to slots=48
>>> }
>>> {
>>> name         users_total
>>> description  Limits imposed on ordinary users
>>> enabled      TRUE
>>> limit        users {*} to slots=64
>>> }
>>>
>>> Still no luck:
>>> ...
>>> cannot run because it exceeds limit "hristo/////" in rule
>>> "users_total/1"
>>> cannot run because it exceeds limit "hristo/////" in rule
>>> "users_total/1"
>>> cannot run because it exceeds limit "hristo/////" in rule
>>> "users_total/1"
>>> ...
>>>
>>> The job runs fine if I disable the users_total rule.
>>>
>>> We used to run 6.2u2_1 before we upgraded to 6.2u5 and a colleague
>>> of mine insists that he was able to run 56-slots jobs before the
>>> upgrade. Have I stumbled upon a bug in 6.2u5 or did I miss the point
>>> in setting up my resource quotas?
>>>
>>> And help would be greatly appreciated.
>>>
>>> Hristo
>>> --
>>> Dr Hristo Iliev
>>> Monte Carlo research group
>>> Faculty of Physics, University of Sofia
>>> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
>>> http://cluster.phys.uni-sofia.bg/hristo/
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247462
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>>> ].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247465
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>
> --
> Dr Hristo Iliev
> Monte Carlo research group
> Faculty of Physics, University of Sofia
> 5 James Bourchier blvd, 1164 Sofia, Bulgaria
> http://cluster.phys.uni-sofia.bg/hristo/
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247501
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=247503

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list