[GE users] Re: "use it or lose it" share tree scheduling

Iwona Sakrejda isakrejda at lbl.gov
Thu Jun 21 20:03:45 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I would like to add couple points to this discussion.
I moved to SGE from LSF and I miss very much flexibility I had with the
previous one as far as what should be taken into account goes.

I miss very much the wall clock. It happens a lot that a job for various 
reasons
will "idle" on a node and not use CPU. It blocks a resource and is not
penalized for it. Having an option to use wall clock instead of CPU
was an easy way to deal with it.

Having an option to use just number of running jobs was useful under
certain circumstances.

And then I could weigh all those ingredients any way I wanted.

Another problem I am having is that array jobs seem to be overcharged
when the usage is calculated (could you point me to the section of code that
deals with it/ I'll be happy to read it). Looks like each array job gets
the CPU usage of the whole array. Array jobs are very helpful but users are
fleeing from them in droves.....

Thank You,

iwona



Daniel Templeton wrote:
> Ryan,
>
> Interesting point.  The reason for the behavior you're seeing is that 
> if you set the halflife_decay_list to all -1's, the share tree usage 
> is only affected by jobs that are currently running.  The only data 
> the system has to go on is the accumulated resource usage of the 
> currently running jobs.  Hence, user A with his really long-running 
> job gets penalized, while user B who is actually getting more the 
> resources is forgiven his sins because his jobs don't hang around long 
> enough to count against him.  Perhaps not exactly intuitive from 
> reading the docs, but it's all there in the source code. ;)
>
> Let's talk for a second about how you would fix this issue.  Given 
> that with halflife_decay_list as -1, the scheduler can only use 
> information from running jobs, how would you look at a snapshot of the 
> running job list and decide how to assign priorities?  You implied 
> that ignoring the accumulated resource usage would be better, but if 
> you ignore that, what have you got?  Even if you were to take, say, a 
> 1 second sampling on the jobs' usage, your numbers would still be far 
> from accurate, as the jobs' will most likely not have uniform resource 
> usage throughout their lifetimes.  My point is not that the Grid 
> Engine behavior in this case is optimal.  My point is only that I 
> don't see that there is an optimal solution, so it's a matter of 
> choosing your shortcomings.
>
> Let me ask the obvious question.  Have you considered using the 
> functional policy?  It is what you would expect the share tree to be 
> if it were flat and had hdl set to -1.  Another option might be to use 
> a halflife_decay_list with a very fast decay rate.  That may come 
> closer to approximating what you're trying to do than setting it to -1.
>
> Daniel
>
>> Date: Thu, 21 Jun 2007 09:09:47 -0400
>> From: Ryan Thomas <Ryan.Thomas at nuance.com>
>> Subject: "use it or lose it" share tree scheduling
>>
>>   It seems from reading the docs that if the halflife_decay_list 
>> elements
>> are set to -1 that only the running jobs are used in usage calculation.
>> This seems to imply that it's possible to implement a "use it or lose
>> it" share tree policy where if any entity in the share tree isn't
>> currently using its resources that they will have no future claim on
>> them.  I think that this is a fairly intuitify and important scheduling
>> policy that should be easy to implement.
>>
>>  
>>
>> I've tried implementing this and found that it's not that simple by
>> reading the code.  The problem is that current usage for a job is
>> defined to be the accumulation of all resources consumed by that job
>> over it's entire run.  If all jobs were approximately the same in their
>> resource usage then there would be no problem.  In the case that there
>> are wide variations in job length then very strange scheduling results
>> occur. 
>>  
>>
>> Consider the simple example of 2 users who are configured in a share
>> tree to each get 50% of the cpu resources on a 10 node grid.  User A
>> always runs jobs that take 100000 seconds while user B's jobs only take
>> 10 seconds.  If we assume that A and B have enough jobs queued up to
>> keep the entire grid busy for a very long time, then the scheduler will
>> fairly quickly reach a steady-state where user A can only run 1 job
>> while user B gets 9 machines on the grid.  The problem is that user B's
>> total usage in this case can never exceed 90 because the longest his
>> jobs run is 10 seconds and he can get 9 machines on the grid.  User A's
>> usage reaches 90 when only 90 seconds have passed and he has to wait
>> another 100000-90 seconds until his usage gets down below user A's so
>> that he can get his next job scheduled.  This is very far from a 
>> 50/50   
>> grid split that was specified in the share tree.
>>   
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list