[GE users] Jobs for cluster management?

Jon Lockley Jon.Lockley at comlab.ox.ac.uk
Fri Dec 23 12:46:24 GMT 2005


Hi everyone,

I'm wondering if the following is already possible (in a non-kludgy way)
or whether it's something sensible to ask for as a new feature.

Traditionally when we want to upgrade the software on nodes in a cluster
we drain work off those nodes by shortening the wall clock limit every few
hours such that it reaches zero when the work is scheduled.  This is a bit
of a pain for the users but they prefer it to the alternative:  killing
all running jobs at a scheduled time. Obviously this means the cluster
gets fairly empty so I'm wondering of there's a better option.

My idea is to have some form of "management job" in the SGE software.
Management jobs run once and once only on each node selected (usually the
whole cluster I guess) as soon as the current (user) job on it finishes.
In other words they jump ahead of regular user jobs on nodes which haven't
yet run the management job. The node in question could then be
automatically released back to normal duties and eventually the whole
cluster will have been upgraded/changed.

The advantages of doing things this way are 1) you don't have to empty the
cluster or kill jobs to do upgrades, 2) you're not changing anything while
users have jobs running and 3) SGE keeps track of which machines do/don't
still need to execute the management tasks.

I grant that this wouldn't be appropriate for every upgrade e.g. where
post-upgrade nodes can't work with pre-upgrade nodes for parallel
applications.  However I can see a lot of scenarios where it makes sense
to couple the job scheduler with cluster management tasks to keep the
cluster as "alive" as possible at all times.

So as I said, I'm curious to know if/how this can be done or alternatively
if other people would find it a useful SGE feature.

All the best,

Jon

------------------------------------------------------------------------
| Dr Jon Lockley, Centre Manager   |                                   |
| Oxford Supercomputing Centre     | Email jon.lockley at comlab.ox.ac.uk |
| Oxford University Computing Lab. | Tel +44 (0)1865 283569            |
| Wolfson Building                 | Fax +44 (0)1865 273839            |
| Parks Rd.                        | www.osc.ox.ac.uk                  |
| Oxford, OX1 3QD                  | "Out of Darkness Cometh Light"    |
| UK                               |                                   |
------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list