[GE users] user loads

Chris Dagdigian dag at sonsorol.org
Tue Sep 16 01:33:04 BST 2008


I think that what you are looking for is already automatically found  
in Grid Engine. Read the SGE documentation and queue configuration  
guides and look particularly for "load alarm threshold" ... the basic  
idea is that SGE has a built-in protection mechanism for over-working  
the compute nodes. The nodes periodically report their load averages  
and if the value exceeds a certain threshold the queue instance trips  
into 'alarm' state and will not take new work even if there are job  
slots free. The nice thing about this feature is that the load alarm  
clears automatically when the load drops on the remote node.

Even if you have no resource quotas and no resource allocation  
policies in effect your system will still be protected from overload  
by this system.

As someone who regularly builds SGE systems and implements policies  
for others I've got some advice for you ...

- Don't make the mistake of trying to get your SGE configuration  
"perfect" the first time around. This is generally impossible --  
you'll never really know the finer points of how your system should be  
configured and tuned until some time after you have turned real users  
loose and started doing real work on the system

My rule of thumb for new SGE projects goes like this:

1. Collect requirements from IT admins and end-users, "translating"  
SGE capabilities if needed. If they can't describe what the cluster  
should do for them or can't really understand SGE without using it  
then I will usually deploy SGE in the default mode with a simple  
"fairshare by user" policy. This is a nice simple way to expose SGE to  
users and everyone understands and appreciates fairshare-by-user when  
you tell them "the scheduler will work to make sure everyone gets a  
fair and equal share of available resources"

2. Implement a "best guess" SGE configuration, open up the cluster for  
users during this "beta" or "trial" period

3. After a few weeks or a month or so, go back and talk to users and  
operators and see what the like and (most importantly) don't like  
about the system

4. Based on feedback from step #3 start refining your configuration,  
making policy changes or adding resource quotas as needed


I have not had time to read all of your most recent messages (business  
travel) but if you are really deploying this system for the first time  
I would not try to get too tricky and/or complicated right at the  
beginning.

You would be well served by:

- Installing SGE with most of the default options enabled
- Installing a simple "fairshare by user" policy (http://gridengine.info/2006/01/17/easy-setup-of-equal-user-fairshare-policy 
  )
- If you are using SGE 6.2 you should re-enable the schedd_job_info  
parameter (http://article.gmane.org/gmane.comp.clustering.gridengine.users/11768 
)

Once you have the basics up and running you can experiment with queue  
settings and resource quotas as needed but I'd recommend just running  
in a basic mode for as long as it takes for you to be able to  
characterize your requirements and your particular application workflow.



Regards,
Chris






On Sep 15, 2008, at 8:14 PM, Mag Gam wrote:

> Hello All,
>
> As many of you know we are putting together a GRID at my university's
> engineering lab. I wanted to know if we can throttle a user's job
> depending on the load of the system. Lets say I have 16 servers and I
> would like to submit a job.Each of these servers are a exec hosts.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list