[GE users] Jobs for cluster management?

Jon Lockley Jon.Lockley at comlab.ox.ac.uk
Fri Dec 23 13:10:14 GMT 2005


Well, kind of.

You'd have "qmod -d" everything you need to upgrade and make a list of all
those nodes. Then frequently run a cron (or similar) job to check when
these nodes become empty. After applying the upgrade you then need to
remove it from the node list (so that you don't try to upgrade it
again). You shouldn't rely on the queue state as a test of upgrades as
there are all sorts of reasons why it might be disabled.

So yes it's do-able with some scripting but *if* other folks are doing the
same stuff would it make any sense (and is it worth the hassle!) to make
it part of SGE?

Thanks,

Jon

On Fri, 23 Dec 2005, Chris Dagdigian wrote:

>
> Sounds like using Grid Engine's disable-queue function ("qmod -d
> <queue instance>") would get you the same thing:
>
> - running jobs are untouched in disabled queues
> - no user jobs are ever touched, suspended, re-queued or killed
> - no new work gets sent to disabled queues (thus draining the machine)
> - you can easily disable every node in the cluster with ("qmod -d
> '*'") or in manageable groups
> - you know which nodes still need admin work done because they are in
> state 'd'
> - a node that is rebooted for admin reasons (update; applied new
> kernel etc.) will still come online in 'disabled' state
>
>
> -Chris
>
>
> On Dec 23, 2005, at 7:46 AM, Jon Lockley wrote:
>
> > Hi everyone,
> >
> > I'm wondering if the following is already possible (in a non-kludgy
> > way)
> > or whether it's something sensible to ask for as a new feature.
> >
> > Traditionally when we want to upgrade the software on nodes in a
> > cluster
> > we drain work off those nodes by shortening the wall clock limit
> > every few
> > hours such that it reaches zero when the work is scheduled.  This
> > is a bit
> > of a pain for the users but they prefer it to the alternative:
> > killing
> > all running jobs at a scheduled time. Obviously this means the cluster
> > gets fairly empty so I'm wondering of there's a better option.
> >
> > My idea is to have some form of "management job" in the SGE software.
> > Management jobs run once and once only on each node selected
> > (usually the
> > whole cluster I guess) as soon as the current (user) job on it
> > finishes.
> > In other words they jump ahead of regular user jobs on nodes which
> > haven't
> > yet run the management job. The node in question could then be
> > automatically released back to normal duties and eventually the whole
> > cluster will have been upgraded/changed.
> >
> > The advantages of doing things this way are 1) you don't have to
> > empty the
> > cluster or kill jobs to do upgrades, 2) you're not changing
> > anything while
> > users have jobs running and 3) SGE keeps track of which machines do/
> > don't
> > still need to execute the management tasks.
> >
> > I grant that this wouldn't be appropriate for every upgrade e.g. where
> > post-upgrade nodes can't work with pre-upgrade nodes for parallel
> > applications.  However I can see a lot of scenarios where it makes
> > sense
> > to couple the job scheduler with cluster management tasks to keep the
> > cluster as "alive" as possible at all times.
> >
> > So as I said, I'm curious to know if/how this can be done or
> > alternatively
> > if other people would find it a useful SGE feature.
> >
> > All the best,
> >
> > Jon
> >
> > ----------------------------------------------------------------------
> > --
> > | Dr Jon Lockley, Centre Manager
> > |                                   |
> > | Oxford Supercomputing Centre     | Email
> > jon.lockley at comlab.ox.ac.uk |
> > | Oxford University Computing Lab. | Tel +44 (0)1865
> > 283569            |
> > | Wolfson Building                 | Fax +44 (0)1865
> > 273839            |
> > | Parks Rd.                        |
> > www.osc.ox.ac.uk                  |
> > | Oxford, OX1 3QD                  | "Out of Darkness Cometh
> > Light"    |
> > | UK
> > |                                   |
> > ----------------------------------------------------------------------
> > --
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


------------------------------------------------------------------------
| Dr Jon Lockley, Centre Manager   |                                   |
| Oxford Supercomputing Centre     | Email jon.lockley at comlab.ox.ac.uk |
| Oxford University Computing Lab. | Tel +44 (0)1865 283569            |
| Wolfson Building                 | Fax +44 (0)1865 273839            |
| Parks Rd.                        | www.osc.ox.ac.uk                  |
| Oxford, OX1 3QD                  | "Out of Darkness Cometh Light"    |
| UK                               |                                   |
------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list