[GE users] Re: "use it or lose it" share tree scheduling

Ryan Thomas Ryan.Thomas at nuance.com
Thu Jun 21 17:15:01 BST 2007

I've been using the functional policy but want the additional
flexibility of the share trees.  I find the functional policy to be too
fiddley with parameters and like the fact that the sharetree actually
tries to figure out how to assign tickets to make the priorities right.

I think the solution to noisy instantaneous measurements that works for
me is setting the SHARETREE_RESERVED_USAGE execd parameter so that the
resource isn't sampled but is constant for the entire time.

I also did try really short halflifes but found that didn't work either.
I think that the reason boils down to the same thing--namely that
running jobs aren't usage decayed at all.  The negative halflife is just
an extreme example of this.

The usage for a running process should never exceed T/ln(2) where T is
the half-life because that is the result of the integral over all time
of the exponential decay function.  SGE ignores this which results in
long-running jobs getting a very poor sharetree usage.

-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Thursday, June 21, 2007 11:07 AM
To: users at gridengine.sunsource.net
Subject: [GE users] Re: "use it or lose it" share tree scheduling


Interesting point.  The reason for the behavior you're seeing is that if

you set the halflife_decay_list to all -1's, the share tree usage is 
only affected by jobs that are currently running.  The only data the 
system has to go on is the accumulated resource usage of the currently 
running jobs.  Hence, user A with his really long-running job gets 
penalized, while user B who is actually getting more the resources is 
forgiven his sins because his jobs don't hang around long enough to 
count against him.  Perhaps not exactly intuitive from reading the docs,

but it's all there in the source code. ;)

Let's talk for a second about how you would fix this issue.  Given that 
with halflife_decay_list as -1, the scheduler can only use information 
from running jobs, how would you look at a snapshot of the running job 
list and decide how to assign priorities?  You implied that ignoring the

accumulated resource usage would be better, but if you ignore that, what

have you got?  Even if you were to take, say, a 1 second sampling on the

jobs' usage, your numbers would still be far from accurate, as the jobs'

will most likely not have uniform resource usage throughout their 
lifetimes.  My point is not that the Grid Engine behavior in this case 
is optimal.  My point is only that I don't see that there is an optimal 
solution, so it's a matter of choosing your shortcomings.

Let me ask the obvious question.  Have you considered using the 
functional policy?  It is what you would expect the share tree to be if 
it were flat and had hdl set to -1.  Another option might be to use a 
halflife_decay_list with a very fast decay rate.  That may come closer 
to approximating what you're trying to do than setting it to -1.


> Date: Thu, 21 Jun 2007 09:09:47 -0400
> From: Ryan Thomas <Ryan.Thomas at nuance.com>
> Subject: "use it or lose it" share tree scheduling
> It seems from reading the docs that if the halflife_decay_list
> are set to -1 that only the running jobs are used in usage
> This seems to imply that it's possible to implement a "use it or lose
> it" share tree policy where if any entity in the share tree isn't
> currently using its resources that they will have no future claim on
> them.  I think that this is a fairly intuitify and important
> policy that should be easy to implement.
> I've tried implementing this and found that it's not that simple by
> reading the code.  The problem is that current usage for a job is
> defined to be the accumulation of all resources consumed by that job
> over it's entire run.  If all jobs were approximately the same in
> resource usage then there would be no problem.  In the case that there
> are wide variations in job length then very strange scheduling results
> occur.  
> Consider the simple example of 2 users who are configured in a share
> tree to each get 50% of the cpu resources on a 10 node grid.  User A
> always runs jobs that take 100000 seconds while user B's jobs only
> 10 seconds.  If we assume that A and B have enough jobs queued up to
> keep the entire grid busy for a very long time, then the scheduler
> fairly quickly reach a steady-state where user A can only run 1 job
> while user B gets 9 machines on the grid.  The problem is that user
> total usage in this case can never exceed 90 because the longest his
> jobs run is 10 seconds and he can get 9 machines on the grid.  User
> usage reaches 90 when only 90 seconds have passed and he has to wait
> another 100000-90 seconds until his usage gets down below user A's so
> that he can get his next job scheduled.  This is very far from a 50/50

> grid split that was specified in the share tree.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list