[GE users] Upgrade howto?

Paul MacInnis macinnis at dal.ca
Mon Apr 14 12:04:31 BST 2008


On Fri, 11 Apr 2008, David Olbersen wrote:

> Hi,
>  
> I read the upgrade HOWTO (from 6.0 -> 6.1) and it makes it sound like
> you have to shut down the entire cluster to upgrade.
> I don't know if my users will tolerate that. Are there alternatives that
> others have used? Is there a phased approach, or something else I can do
> where I don't have to wait for all the jobs to finish?
>  
> ________________________________
> 
> David Olbersen
>  

Hi David,

You might consider installing (not upgrading) 6.1 to run in parallel
with 6.0 for a time.

This is our experience, for what it's worth ...

Last year we moved from 5.3 to 6.1.

Rather than upgrade, which wasn't possible, we did a complete new
install of 6.1 onto the master node.  To keep the 2 versions separate we
used a different SGE_ROOT location and different SGE_QMASTER_PORT and
SGE_EXECD_PORT port numbers.

We cut several nodes from 5.3 and installed 6.1 on them for testing.
We ran 6.1 and 5.3 like this in parallel for several weeks until we
had 6.1 queues, etc setup the way we wanted.  Note, we use classic
spooling so everything related to each version was stored under the
appropriate SGE_ROOT.

When everything was ready we installed 6.1 on all remaining slave nodes
and set a date for the switchover.  On that date we changed the
system-wide login script to source the 6.1 sge_settings.sh rather than
the 5.3 one. We also made the 5.3 qsub non-executable.  After that new
jobs went to 6.1 and the running jobs on 5.3 eventually finished.  Note,
our queues all have load_avg as a load_threshold which prevented each
scheduler from sending jobs to nodes running the other scheduler's jobs.

During the switch over we discovered that one user's jobs wouldn't work on
6.1 so we changed his login to source the 5.3 sge_settings.sh and allowed
him to use 5.3's qsub until we could trace the problem.  There was a
slight danger here that an idle node could be hit by both schedulers
and become overloaded but it never happened.

This parallel operation worked well because the SGE developers seem
to have taken care that all the pieces - qstat, qhost, qmod, etc - follow
the caller's environment settings for SGE_ROOT, SGE_QMASTER_PORT and
SGE_EXECD_PORT.

I hope this helps,

Paul






---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list