[GE users] Best way to shut down an SGE cluster

templedf dan.templeton at sun.com
Tue Feb 9 21:38:57 GMT 2010

The fast way to shutdown a cluster is "qconf -kej all -km".  That stops 
the qmaster and all execds cleanly and aborts all running jobs.  The 
jobs will need to be resubmitted.

If you have time for an extra step or two, you can "qmod -d \*; qmod -rj 
\*; qconf -kej all -km".  That will disable all queues and then requeue 
all jobs, before stopping all the execds and master.  When you bring the 
cluster back up, run "qmod -e \*" to bring your queues back online, and 
the jobs should automatically be rescheduled.  This all requires that 
your queues have rerun set to TRUE, though.

Also note, the -ke or -kej has to come before the -km.  Nothing after 
the -km will have any effect.


On 02/09/10 13:20, prentice wrote:
> I'm sure this has been discussed many times on this list, but I couldn't
> find a good answer by search (I was probably wasn't using the magical
> combination of search terms to get my answer):
> If you need to shut down an entire cluster quickly (power outage, for
> example) what is the best way to shutdown SGE, so that there's minimal
> disruption when the cluster starts up again?
> Will jobs that were running be restarted/requeued,  or will they need to
> be submitted again? I know that data will be lost if the jobs don't
> provide their own checkpointing.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list