[GE users] "use it or lose it" share tree scheduling
Ryan.Thomas at nuance.com
Thu Jun 21 14:09:47 BST 2007
It seems from reading the docs that if the halflife_decay_list elements
are set to -1 that only the running jobs are used in usage calculation.
This seems to imply that it's possible to implement a "use it or lose
it" share tree policy where if any entity in the share tree isn't
currently using its resources that they will have no future claim on
them. I think that this is a fairly intuitify and important scheduling
policy that should be easy to implement.
I've tried implementing this and found that it's not that simple by
reading the code. The problem is that current usage for a job is
defined to be the accumulation of all resources consumed by that job
over it's entire run. If all jobs were approximately the same in their
resource usage then there would be no problem. In the case that there
are wide variations in job length then very strange scheduling results
Consider the simple example of 2 users who are configured in a share
tree to each get 50% of the cpu resources on a 10 node grid. User A
always runs jobs that take 100000 seconds while user B's jobs only take
10 seconds. If we assume that A and B have enough jobs queued up to
keep the entire grid busy for a very long time, then the scheduler will
fairly quickly reach a steady-state where user A can only run 1 job
while user B gets 9 machines on the grid. The problem is that user B's
total usage in this case can never exceed 90 because the longest his
jobs run is 10 seconds and he can get 9 machines on the grid. User A's
usage reaches 90 when only 90 seconds have passed and he has to wait
another 100000-90 seconds until his usage gets down below user A's so
that he can get his next job scheduled. This is very far from a 50/50
grid split that was specified in the share tree.
It seems to me that in the case that any time halflife_decay_list = -1
(the infinitely fast decay case) that the usage for a job should be
either an instantaneous sampling of the current job usage or some sort
average resource usage over the time of the job if the instantaneous
sample is deemed to be too noisy.
Is anyone using a negative halflife_decay_list parameter successfully?
Would anyone object to a patch that attempts to implement the change in
how usage for running jobs is computed in the case that
halflife_decay_list has a negative parameter?
More information about the gridengine-users