[GE users] Improving throughput/responsiveness
aaron at cs.york.ac.uk
Wed May 5 09:29:39 BST 2004
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
We currently have a series of queues, essentially
- very long (restricted access, 2 weeks, priority 0, 4 slots),
- long (1 week, priority 0, 8 slots),
- medium (96 hours, priority -20, 8 slots)
- short (24 hours, priority 0, 12 slots)
This has been working pretty well, with long term
loads from 50 to 80%.
Note that the total number of slots exceeds the number
of processors (20). This is intended to allow the system to
be CPU saturated with job of longer or shorter types,
so that even if no short jobs are submitted the CPUs are
used as much as possible.
We were having an issue with memory starvation (suggested
changes not yet implemented, and due for implementation
next week) but I introduced some changes to the share
system. Previously the usage was totally unrestricted in
an attempt to soak up all spare CPU cycles when the number
of users was comparatively small, but the number of users
and usage has now grown at a steady rate to something more
substantial and so I modified the share system to enforce
shares per user (equal share for every user). However to
try and soak up CPU time I also set a compensation factor
of 2, and a short half life of 1 week.
Users are now complaining that the short queue is difficult
to get access to. Since users are not really using checkpointing
a lot of users are using the longer queues (hence the long
length of these queues - longer than the ideal).
The priorities were copied from another site's configuration,
and I am wondering if these are sensible, or if there are
other schemes that might lead to good responsiveness for users
wanting to run short jobs and longer jobs.
What I was considering was splitting the queues and creating
a series of high priority queues, and low priority queues,
with other parameters (apart from slots and priority) remaining
the same as they are currently. Would this be a sensible step,
or are there better techniques? Should I change the balance
of slots in the queues to ensure that even a system full of
long jobs leaves CPUs free for short jobs?
P.S. In theory, currently, the short queue should be able to
suspend the long queue when the load gets too high, but the
threshold seems to be a bit too high for this to become
operational, but the ability to run jobs in the short queue
is still low.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users