[GE users] Upgrade howto?

David Olbersen dolbersen at nextwave.com
Mon Apr 14 17:33:30 BST 2008


Paul,

That's awesome, thank you so much for sharing.
This might be the approach we end up taking.
You say you only changed the SGE_ROOT and the port's, did you need to
change the cell name, or was that handled by having a different root?

-- 
David Olbersen
 

-----Original Message-----
From: Paul MacInnis [mailto:macinnis at dal.ca] 
Sent: Monday, April 14, 2008 4:05 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Upgrade howto?

On Fri, 11 Apr 2008, David Olbersen wrote:

> Hi,
>  
> I read the upgrade HOWTO (from 6.0 -> 6.1) and it makes it sound like 
> you have to shut down the entire cluster to upgrade.
> I don't know if my users will tolerate that. Are there alternatives 
> that others have used? Is there a phased approach, or something else I

> can do where I don't have to wait for all the jobs to finish?
>  
> ________________________________
> 
> David Olbersen
>  

Hi David,

You might consider installing (not upgrading) 6.1 to run in parallel
with 6.0 for a time.

This is our experience, for what it's worth ...

Last year we moved from 5.3 to 6.1.

Rather than upgrade, which wasn't possible, we did a complete new
install of 6.1 onto the master node.  To keep the 2 versions separate we
used a different SGE_ROOT location and different SGE_QMASTER_PORT and
SGE_EXECD_PORT port numbers.

We cut several nodes from 5.3 and installed 6.1 on them for testing.
We ran 6.1 and 5.3 like this in parallel for several weeks until we had
6.1 queues, etc setup the way we wanted.  Note, we use classic spooling
so everything related to each version was stored under the appropriate
SGE_ROOT.

When everything was ready we installed 6.1 on all remaining slave nodes
and set a date for the switchover.  On that date we changed the
system-wide login script to source the 6.1 sge_settings.sh rather than
the 5.3 one. We also made the 5.3 qsub non-executable.  After that new
jobs went to 6.1 and the running jobs on 5.3 eventually finished.  Note,
our queues all have load_avg as a load_threshold which prevented each
scheduler from sending jobs to nodes running the other scheduler's jobs.

During the switch over we discovered that one user's jobs wouldn't work
on
6.1 so we changed his login to source the 5.3 sge_settings.sh and
allowed him to use 5.3's qsub until we could trace the problem.  There
was a slight danger here that an idle node could be hit by both
schedulers and become overloaded but it never happened.

This parallel operation worked well because the SGE developers seem to
have taken care that all the pieces - qstat, qhost, qmod, etc - follow
the caller's environment settings for SGE_ROOT, SGE_QMASTER_PORT and
SGE_EXECD_PORT.

I hope this helps,

Paul






---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list