[GE users] Re: "use it or lose it" share tree scheduling

Rayson Ho rayrayson at gmail.com
Thu Jun 21 20:29:06 BST 2007

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On 6/21/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
> I miss very much the wall clock. It happens a lot that a job for various
> reasons
> will "idle" on a node and not use CPU. It blocks a resource and is not
> penalized for it. Having an option to use wall clock instead of CPU
> was an easy way to deal with it.

I remember Andy mentioned a way to do this -- I didn't save the
original message in my (own) DB...

> Another problem I am having is that array jobs seem to be overcharged
> when the usage is calculated (could you point me to the section of code that
> deals with it/ I'll be happy to read it). Looks like each array job gets
> the CPU usage of the whole array. Array jobs are very helpful but users are
> fleeing from them in droves.....

How to reproduce it?? Is it a parallel or serial job??

The CPU usage is collected by the execds on each node... and then sent
to the qmaster before it gets written to the accounting file.


> Thank You,
> iwona
> Daniel Templeton wrote:
> > Ryan,
> >
> > Interesting point.  The reason for the behavior you're seeing is that
> > if you set the halflife_decay_list to all -1's, the share tree usage
> > is only affected by jobs that are currently running.  The only data
> > the system has to go on is the accumulated resource usage of the
> > currently running jobs.  Hence, user A with his really long-running
> > job gets penalized, while user B who is actually getting more the
> > resources is forgiven his sins because his jobs don't hang around long
> > enough to count against him.  Perhaps not exactly intuitive from
> > reading the docs, but it's all there in the source code. ;)
> >
> > Let's talk for a second about how you would fix this issue.  Given
> > that with halflife_decay_list as -1, the scheduler can only use
> > information from running jobs, how would you look at a snapshot of the
> > running job list and decide how to assign priorities?  You implied
> > that ignoring the accumulated resource usage would be better, but if
> > you ignore that, what have you got?  Even if you were to take, say, a
> > 1 second sampling on the jobs' usage, your numbers would still be far
> > from accurate, as the jobs' will most likely not have uniform resource
> > usage throughout their lifetimes.  My point is not that the Grid
> > Engine behavior in this case is optimal.  My point is only that I
> > don't see that there is an optimal solution, so it's a matter of
> > choosing your shortcomings.
> >
> > Let me ask the obvious question.  Have you considered using the
> > functional policy?  It is what you would expect the share tree to be
> > if it were flat and had hdl set to -1.  Another option might be to use
> > a halflife_decay_list with a very fast decay rate.  That may come
> > closer to approximating what you're trying to do than setting it to -1.
> >
> > Daniel
> >
> >> Date: Thu, 21 Jun 2007 09:09:47 -0400
> >> From: Ryan Thomas <Ryan.Thomas at nuance.com>
> >> Subject: "use it or lose it" share tree scheduling
> >>
> >>   It seems from reading the docs that if the halflife_decay_list
> >> elements
> >> are set to -1 that only the running jobs are used in usage calculation.
> >> This seems to imply that it's possible to implement a "use it or lose
> >> it" share tree policy where if any entity in the share tree isn't
> >> currently using its resources that they will have no future claim on
> >> them.  I think that this is a fairly intuitify and important scheduling
> >> policy that should be easy to implement.
> >>
> >>
> >>
> >> I've tried implementing this and found that it's not that simple by
> >> reading the code.  The problem is that current usage for a job is
> >> defined to be the accumulation of all resources consumed by that job
> >> over it's entire run.  If all jobs were approximately the same in their
> >> resource usage then there would be no problem.  In the case that there
> >> are wide variations in job length then very strange scheduling results
> >> occur.
> >>
> >>
> >> Consider the simple example of 2 users who are configured in a share
> >> tree to each get 50% of the cpu resources on a 10 node grid.  User A
> >> always runs jobs that take 100000 seconds while user B's jobs only take
> >> 10 seconds.  If we assume that A and B have enough jobs queued up to
> >> keep the entire grid busy for a very long time, then the scheduler will
> >> fairly quickly reach a steady-state where user A can only run 1 job
> >> while user B gets 9 machines on the grid.  The problem is that user B's
> >> total usage in this case can never exceed 90 because the longest his
> >> jobs run is 10 seconds and he can get 9 machines on the grid.  User A's
> >> usage reaches 90 when only 90 seconds have passed and he has to wait
> >> another 100000-90 seconds until his usage gets down below user A's so
> >> that he can get his next job scheduled.  This is very far from a
> >> 50/50
> >> grid split that was specified in the share tree.
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list