[GE users] checking mount points or any other user defined attributes

fx d.love at liverpool.ac.uk
Sun Nov 28 15:55:57 GMT 2010

craffi <dag at sonsorol.org> writes:

> Best implementation I saw was at a site where the admins had a script 
> that probed for every OS issue they had ever encountered in the past. 
> The script ran at node boot time and periodically afterwards. As soon as 
> any problem was detected the node gets put into disabled state 'd' and 
> the admins get notified.

I'd have hoped that sort of thing was standard practice, for some value
of `every OS issue'.  (I use Nagios.)  You do need to judge whether it's
worth it for a particular failure mode, both in terms of resources to
write/organize a test, and the resources to run it, which might have a
significant effect on the compute nodes, or the head, if you're running
it there.

The SGE angle is that the job prolog/epilog are a convenient place to
make tests just at the time they particularly matter, without putting a
continual load on the node.  You can either ensure the queue goes into
an error state, check that and rely on figuring out why, or use
something like NCSA under Nagios.  To have Nagios disable queues, for
instance, you have to be careful either to run specific commands under
sudo or make sure nagios has appropriate SGE privileges, and it's not
necessarily easy to test that it all works.

Dave Love
Advanced Research Computing, Computing Services, University of Liverpool
AKA fx at gnu.org


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list