[GE users] Re: "use it or lose it" share tree scheduling

Iwona Sakrejda isakrejda at lbl.gov
Wed Sep 12 23:14:09 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I came back to an old issue

Rayson Ho wrote:
> OK, now with the correct option name, I can google for the manpage:
>
> sge_conf(5):
>       SHARETREE_RESERVED_USAGE
>              If  this  parameter  is set to true, reserved usage is 
> taken for
>              the Grid Engine  share  tree  consumption  instead  of  
> measured
>              usage.
>
> So it should do what you want...
Is the reserved usage just the wallclock or the time the job was requesting?

And If a job does not specify any wallclock limit, is this a limit for a 
queue?

I want to take into account the wallclock time, but only the time that 
was actually used.
The man page is not quite clear about it...

Thank You,

Iwona



>
> Rayson
>
>
> On 6/21/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>> I looked at that entry and seems to me that it refers to proper
>> accounting of wall clock
>> and not substituting wall clock for CPU in the priority calculation 
>> .......
>> The issue is not with charging, but what is taken into account when
>> priorities based on shares
>> and usage are calculated.....
>>
>> Iwona
>>
>> >
>> >
>> >>
>> >>> Another problem I am having is that array jobs seem to be
>> >>>
>> >> overcharged
>> >>
>> >>> when the usage is calculated (could you point me to the
>> >>>
>> >> section of code that
>> >>
>> >>> deals with it/ I'll be happy to read it). Looks like each
>> >>>
>> >> array job gets
>> >>
>> >>> the CPU usage of the whole array. Array jobs are very
>> >>>
>> >> helpful but users are
>> >>
>> >>> fleeing from them in droves.....
>> >>>
>> >> How to reproduce it?? Is it a parallel or serial job??
>> >>
>> >> The CPU usage is collected by the execds on each node... and
>> >> then sent
>> >> to the qmaster before it gets written to the accounting file.
>> >>
>> >> Rayson
>> >>
>> >>
>> >>
>> >>> Thank You,
>> >>>
>> >>> iwona
>> >>>
>> >>>
>> >>>
>> >>> Daniel Templeton wrote:
>> >>>
>> >>>> Ryan,
>> >>>>
>> >>>> Interesting point.  The reason for the behavior you're
>> >>>>
>> >> seeing is that
>> >>
>> >>>> if you set the halflife_decay_list to all -1's, the share
>> >>>>
>> >> tree usage
>> >>
>> >>>> is only affected by jobs that are currently running.  The
>> >>>>
>> >> only data
>> >>
>> >>>> the system has to go on is the accumulated resource usage
>> >>>>
>> >> of the
>> >>
>> >>>> currently running jobs.  Hence, user A with his really
>> >>>>
>> >> long-running
>> >>
>> >>>> job gets penalized, while user B who is actually getting
>> >>>>
>> >> more the
>> >>
>> >>>> resources is forgiven his sins because his jobs don't hang
>> >>>>
>> >> around long
>> >>
>> >>>> enough to count against him.  Perhaps not exactly
>> >>>>
>> >> intuitive from
>> >>
>> >>>> reading the docs, but it's all there in the source code.
>> >>>>
>> >> ;)
>> >>
>> >>>> Let's talk for a second about how you would fix this
>> >>>>
>> >> issue.  Given
>> >>
>> >>>> that with halflife_decay_list as -1, the scheduler can
>> >>>>
>> >> only use
>> >>
>> >>>> information from running jobs, how would you look at a
>> >>>>
>> >> snapshot of the
>> >>
>> >>>> running job list and decide how to assign priorities?  You
>> >>>>
>> >> implied
>> >>
>> >>>> that ignoring the accumulated resource usage would be
>> >>>>
>> >> better, but if
>> >>
>> >>>> you ignore that, what have you got?  Even if you were to
>> >>>>
>> >> take, say, a
>> >>
>> >>>> 1 second sampling on the jobs' usage, your numbers would
>> >>>>
>> >> still be far
>> >>
>> >>>> from accurate, as the jobs' will most likely not have
>> >>>>
>> >> uniform resource
>> >>
>> >>>> usage throughout their lifetimes.  My point is not that
>> >>>>
>> >> the Grid
>> >>
>> >>>> Engine behavior in this case is optimal.  My point is only
>> >>>>
>> >> that I
>> >>
>> >>>> don't see that there is an optimal solution, so it's a
>> >>>>
>> >> matter of
>> >>
>> >>>> choosing your shortcomings.
>> >>>>
>> >>>> Let me ask the obvious question.  Have you considered
>> >>>>
>> >> using the
>> >>
>> >>>> functional policy?  It is what you would expect the share
>> >>>>
>> >> tree to be
>> >>
>> >>>> if it were flat and had hdl set to -1.  Another option
>> >>>>
>> >> might be to use
>> >>
>> >>>> a halflife_decay_list with a very fast decay rate.  That
>> >>>>
>> >> may come
>> >>
>> >>>> closer to approximating what you're trying to do than
>> >>>>
>> >> setting it to -1.
>> >>
>> >>>> Daniel
>> >>>>
>> >>>>
>> >>>>> Date: Thu, 21 Jun 2007 09:09:47 -0400
>> >>>>> From: Ryan Thomas <Ryan.Thomas at nuance.com>
>> >>>>> Subject: "use it or lose it" share tree scheduling
>> >>>>>
>> >>>>>   It seems from reading the docs that if the
>> >>>>>
>> >> halflife_decay_list
>> >>
>> >>>>> elements
>> >>>>> are set to -1 that only the running jobs are used in
>> >>>>>
>> >> usage calculation.
>> >>
>> >>>>> This seems to imply that it's possible to implement a
>> >>>>>
>> >> "use it or lose
>> >>
>> >>>>> it" share tree policy where if any entity in the share
>> >>>>>
>> >> tree isn't
>> >>
>> >>>>> currently using its resources that they will have no
>> >>>>>
>> >> future claim on
>> >>
>> >>>>> them.  I think that this is a fairly intuitify and
>> >>>>>
>> >> important scheduling
>> >>
>> >>>>> policy that should be easy to implement.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> I've tried implementing this and found that it's not that
>> >>>>>
>> >> simple by
>> >>
>> >>>>> reading the code.  The problem is that current usage for
>> >>>>>
>> >> a job is
>> >>
>> >>>>> defined to be the accumulation of all resources consumed
>> >>>>>
>> >> by that job
>> >>
>> >>>>> over it's entire run.  If all jobs were approximately the
>> >>>>>
>> >> same in their
>> >>
>> >>>>> resource usage then there would be no problem.  In the
>> >>>>>
>> >> case that there
>> >>
>> >>>>> are wide variations in job length then very strange
>> >>>>>
>> >> scheduling results
>> >>
>> >>>>> occur.
>> >>>>>
>> >>>>>
>> >>>>> Consider the simple example of 2 users who are configured
>> >>>>>
>> >> in a share
>> >>
>> >>>>> tree to each get 50% of the cpu resources on a 10 node
>> >>>>>
>> >> grid.  User A
>> >>
>> >>>>> always runs jobs that take 100000 seconds while user B's
>> >>>>>
>> >> jobs only take
>> >>
>> >>>>> 10 seconds.  If we assume that A and B have enough jobs
>> >>>>>
>> >> queued up to
>> >>
>> >>>>> keep the entire grid busy for a very long time, then the
>> >>>>>
>> >> scheduler will
>> >>
>> >>>>> fairly quickly reach a steady-state where user A can only
>> >>>>>
>> >> run 1 job
>> >>
>> >>>>> while user B gets 9 machines on the grid.  The problem is
>> >>>>>
>> >> that user B's
>> >>
>> >>>>> total usage in this case can never exceed 90 because the
>> >>>>>
>> >> longest his
>> >>
>> >>>>> jobs run is 10 seconds and he can get 9 machines on the
>> >>>>>
>> >> grid.  User A's
>> >>
>> >>>>> usage reaches 90 when only 90 seconds have passed and he
>> >>>>>
>> >> has to wait
>> >>
>> >>>>> another 100000-90 seconds until his usage gets down below
>> >>>>>
>> >> user A's so
>> >>
>> >>>>> that he can get his next job scheduled.  This is very far
>> >>>>>
>> >> from a
>> >>
>> >>>>> 50/50
>> >>>>> grid split that was specified in the share tree.
>> >>>>>
>> >>>>>
>> >>>>
>> > ---------------------------------------------------------------------
>> >
>> >>>> To unsubscribe, e-mail:
>> >>>>
>> >> users-unsubscribe at gridengine.sunsource.net
>> >>
>> >>>> For additional commands, e-mail:
>> >>>>
>> >> users-help at gridengine.sunsource.net
>> >>
>> >>>
>> > ---------------------------------------------------------------------
>> >
>> >>> To unsubscribe, e-mail:
>> >>>
>> >> users-unsubscribe at gridengine.sunsource.net
>> >>
>> >>> For additional commands, e-mail:
>> >>>
>> >> users-help at gridengine.sunsource.net
>> >>
>> >>>
>> >>
>> > ---------------------------------------------------------------------
>> >
>> >> To unsubscribe, e-mail:
>> >> users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail:
>> >> users-help at gridengine.sunsource.net
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> > 
>> ____________________________________________________________________________________ 
>>
>> > Choose the right car based on your needs.  Check out Yahoo! Autos 
>> new Car Finder tool.
>> > http://autos.yahoo.com/carfinder/
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list