[GE users] Help with resource quota sets

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Fri Jan 25 13:59:56 GMT 2008


Hi John,

On Fri, 25 Jan 2008, jallen at it.uts.edu.au wrote:

> Hi,
>
> Firstly, sorry for the long mail.
>
> We have been experimenting with the use of the new RQS feature as a way of
> limiting resources to specific projects (using GE6.1u2). Unfortunately in
> our setup we have a legacy attribute "big_slots" which is a forced
> consumable attached to each host indicating how many processes the host
> can run, so we do:
>
> qsub -l big_slots=2 <args>
>
> We have the following example resource quota sets which are enabled
> automatically at 9AM and disabled at 9PM (by cron job) by flipping the
> "enabled" flag.
>
> % qconf -srqs
> {
>   name         project_slots
>   description  "divide the available daytime slots between projects"
>   enabled      TRUE
>   limit        projects show1 hosts !@workstations to big_slots=40
>   limit        projects show2,show3 hosts !@workstations to big_slots=214
> }
>
> In case we have this wrong, the rules are intended to do the following:
> 1. Limit show1 to a total 40 "big_slots" on non-workstations.
> 2. Limit show1+show2 to a total of 214 "big_slots" on non-workstations.
>
> The restrictions are indented to apply only to hosts that arn't in
> @workstations as they do not provide a stable and reliable resource (so
> any show can use @workstations without being counted as quota).
>
> This seems to works fine, and we even wrote a small tool around qstat to
> count quota use.
>
> % python quota.py
> {'show2+show3': 198, 'NONE': 16, 'workstations': 18, 'show1': 8}
>
> Now to the problems we are facing:
>
> If we run the qquota command it prints nothing regardless of what
> arguments are given:
>
> % qquota -P show2
> <no output>

there is nothing wrong with your RQS definition. Instead it seems qquota 
has problems with negative host scopes

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2472

rephrasing the quota definition helped me to get the expected qquota 
output. I could reproduce this in 6.1, 6.1u2, and 6.1u3 and I'm 
sure we have it in maintrunk as well.

>
> A more serious problem is that, at random (usually after 1-5 days) the
> quota starts allowing far less jobs to run than the quota should allow.
> For jobs submitted with -P, if we do qstat -j <job> it will mention the
> job can't run because it exceeds rule /2 of the resource quota. If we
> restart the scheduler/qmaster, the restriction gets cleared and the quota
> will work properly for a while until it breaks again.
> Any ideas why this would happen? Is it a problem with our defined RQSs or
> perhaps the use of "big_slots" consumable? Could this be a bug in GE that
> is fixed in 6.1u3/u4? Could disabling and enabling the quota with the
> enabled attribute create problems?

I may be wrong, but I would not expect rephrasing the quota will solve 
it. At least the code that causes the wrong output is used in qquota 
only.

So I checked the source code whether it could be related to short/long 
hostnames being used at debitation/undebitation time, but I believe I 
can rule this out.

I also tested quota-bound launching/finishing jobs while the rephrased 
big_slots quota was disabled. Seems to be working, except qquota spoofed me.

Possibly the #2472 workaround helps you to understand the phenomenon. If
not we might have to add some instrumentations to qmaster so that we find
the cause for this nasty thing.

Best regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list