[GE users] user loads
dag at sonsorol.org
Tue Sep 16 01:33:04 BST 2008
I think that what you are looking for is already automatically found
in Grid Engine. Read the SGE documentation and queue configuration
guides and look particularly for "load alarm threshold" ... the basic
idea is that SGE has a built-in protection mechanism for over-working
the compute nodes. The nodes periodically report their load averages
and if the value exceeds a certain threshold the queue instance trips
into 'alarm' state and will not take new work even if there are job
slots free. The nice thing about this feature is that the load alarm
clears automatically when the load drops on the remote node.
Even if you have no resource quotas and no resource allocation
policies in effect your system will still be protected from overload
by this system.
As someone who regularly builds SGE systems and implements policies
for others I've got some advice for you ...
- Don't make the mistake of trying to get your SGE configuration
"perfect" the first time around. This is generally impossible --
you'll never really know the finer points of how your system should be
configured and tuned until some time after you have turned real users
loose and started doing real work on the system
My rule of thumb for new SGE projects goes like this:
1. Collect requirements from IT admins and end-users, "translating"
SGE capabilities if needed. If they can't describe what the cluster
should do for them or can't really understand SGE without using it
then I will usually deploy SGE in the default mode with a simple
"fairshare by user" policy. This is a nice simple way to expose SGE to
users and everyone understands and appreciates fairshare-by-user when
you tell them "the scheduler will work to make sure everyone gets a
fair and equal share of available resources"
2. Implement a "best guess" SGE configuration, open up the cluster for
users during this "beta" or "trial" period
3. After a few weeks or a month or so, go back and talk to users and
operators and see what the like and (most importantly) don't like
about the system
4. Based on feedback from step #3 start refining your configuration,
making policy changes or adding resource quotas as needed
I have not had time to read all of your most recent messages (business
travel) but if you are really deploying this system for the first time
I would not try to get too tricky and/or complicated right at the
You would be well served by:
- Installing SGE with most of the default options enabled
- Installing a simple "fairshare by user" policy (http://gridengine.info/2006/01/17/easy-setup-of-equal-user-fairshare-policy
- If you are using SGE 6.2 you should re-enable the schedd_job_info
Once you have the basics up and running you can experiment with queue
settings and resource quotas as needed but I'd recommend just running
in a basic mode for as long as it takes for you to be able to
characterize your requirements and your particular application workflow.
On Sep 15, 2008, at 8:14 PM, Mag Gam wrote:
> Hello All,
> As many of you know we are putting together a GRID at my university's
> engineering lab. I wanted to know if we can throttle a user's job
> depending on the load of the system. Lets say I have 16 servers and I
> would like to submit a job.Each of these servers are a exec hosts.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users