[GE users] switching off nodes

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Thu Feb 10 18:08:22 GMT 2005

On Thu, 2005-02-10 at 12:57, Sebastian Stark wrote:

> Has anyone ever done this?

I've not done this, but I have been looking at shutdown/restart of grid
engine execution nodes and at least for Fedora Core 1 and 6.0u3 lx24-x86
it's not very clean.  I many times have to manually go in and restart
grid engine after a host comes back up because the
administration/scheduler still thinks the host is still around and
rejects the new connection because it already has one by the same name. 
Here is the error message:

02/09/2005 09:04:37|execd|dgruhn-lx|E|commlib error: endpoint is not
unique error (endpoint "dgruhn-lx.group-w-inc.com/execd/1" is already
02/09/2005 09:04:40|execd|dgruhn-lx|E|getting configuration: unable to
contact qmaster using port 461 on host "alice.group-w-inc.com"
02/09/2005 09:04:40|execd|dgruhn-lx|W|can't get configuration from
qmaster -- waiting ...
02/09/2005 09:04:41|execd|dgruhn-lx|E|there is already a client endpoint
dgruhn-lx.group-w-inc.com/execd/1 connected to qmaster service

> I want to shut down idle nodes to save money. I don't see any problem from the 
> technical side (buy power switches and make them configurable over the 
> network or serial port) but I'm wondering what issues there could be with 
> gridengine.
> To decide wether some nodes should be shut down I would have to gather usage 
> data and make the nodes power status dependent on acutal number of jobs 
> waiting for resources or something like this. Has anyone ever thought this 
> through a bit?
> Issues I can think of are:
>  - race conditions (systems decides to shut down node but between decision and 
> shutdown a job slips in)

You could solve this one by suspending any queues on the node and then
checking to make sure that no jobs have been allocated to the queue.  If
not, then proceed with an orderly shutdown and power off.  If you are
really cutting the power, make sure that your BIOS is set to power up
when power is restored.

>  - nodes get shut down based on wrong decisions by the system
>  - waiting time until a node is booted (there should be hot spares...)
>  - don't really shut them down but put them in acpi standby and use 
> wake-on-lan to bring them back to life, should be reasonably fast.
> Thanks for any input!
> -Sebastian


