[GE users] Jobs for cluster management?

Chris Dagdigian dag at sonsorol.org
Fri Dec 23 12:58:28 GMT 2005


Sounds like using Grid Engine's disable-queue function ("qmod -d  
<queue instance>") would get you the same thing:

- running jobs are untouched in disabled queues
- no user jobs are ever touched, suspended, re-queued or killed
- no new work gets sent to disabled queues (thus draining the machine)
- you can easily disable every node in the cluster with ("qmod -d  
'*'") or in manageable groups
- you know which nodes still need admin work done because they are in  
state 'd'
- a node that is rebooted for admin reasons (update; applied new  
kernel etc.) will still come online in 'disabled' state


-Chris


On Dec 23, 2005, at 7:46 AM, Jon Lockley wrote:

> Hi everyone,
>
> I'm wondering if the following is already possible (in a non-kludgy  
> way)
> or whether it's something sensible to ask for as a new feature.
>
> Traditionally when we want to upgrade the software on nodes in a  
> cluster
> we drain work off those nodes by shortening the wall clock limit  
> every few
> hours such that it reaches zero when the work is scheduled.  This  
> is a bit
> of a pain for the users but they prefer it to the alternative:   
> killing
> all running jobs at a scheduled time. Obviously this means the cluster
> gets fairly empty so I'm wondering of there's a better option.
>
> My idea is to have some form of "management job" in the SGE software.
> Management jobs run once and once only on each node selected  
> (usually the
> whole cluster I guess) as soon as the current (user) job on it  
> finishes.
> In other words they jump ahead of regular user jobs on nodes which  
> haven't
> yet run the management job. The node in question could then be
> automatically released back to normal duties and eventually the whole
> cluster will have been upgraded/changed.
>
> The advantages of doing things this way are 1) you don't have to  
> empty the
> cluster or kill jobs to do upgrades, 2) you're not changing  
> anything while
> users have jobs running and 3) SGE keeps track of which machines do/ 
> don't
> still need to execute the management tasks.
>
> I grant that this wouldn't be appropriate for every upgrade e.g. where
> post-upgrade nodes can't work with pre-upgrade nodes for parallel
> applications.  However I can see a lot of scenarios where it makes  
> sense
> to couple the job scheduler with cluster management tasks to keep the
> cluster as "alive" as possible at all times.
>
> So as I said, I'm curious to know if/how this can be done or  
> alternatively
> if other people would find it a useful SGE feature.
>
> All the best,
>
> Jon
>
> ---------------------------------------------------------------------- 
> --
> | Dr Jon Lockley, Centre Manager    
> |                                   |
> | Oxford Supercomputing Centre     | Email  
> jon.lockley at comlab.ox.ac.uk |
> | Oxford University Computing Lab. | Tel +44 (0)1865  
> 283569            |
> | Wolfson Building                 | Fax +44 (0)1865  
> 273839            |
> | Parks Rd.                        |  
> www.osc.ox.ac.uk                  |
> | Oxford, OX1 3QD                  | "Out of Darkness Cometh  
> Light"    |
> | UK                                
> |                                   |
> ---------------------------------------------------------------------- 
> --
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list