[GE users] Infiniband loadsensor

fx d.love at liverpool.ac.uk
Mon Aug 23 19:57:51 BST 2010

erilon78se <erik.lonroth at scania.com> writes:

> Hello!
> Does anyone know how to disable a node in SGE/OGE if a custom "load sensor" detects that the state of the link is bad.

[Someone else with unreliable cards :-(.]

Although using a load sensor is neat, you probably want alerts for the
alarm state on queues somehow.  If you use something like Nagios for
that, you might as well do the whole job with it.  The details will
depend on your setup, but I detect patterns in centralized syslog-ng
logs and disable the queues with an eventhandler.  The nagios account
has to be an SGE admin user, of course, with nagios run on an admin
host.  It would be nice to have IB on the head to check the health of
the fabric directly too.

The load sensor would have an advantage if it reduced the window for
another job to start up and fail on that node.  As an alternative, I've
thought of testing and acting on the state in an epilogue, assuming the
errors cause the job to fail, as they normally do with us.  The epilogue
would need access to data from the other nodes somehow.

Dave Love
Advanced Research Computing, Computing Services, University of Liverpool
AKA fx at gnu.org


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list