[GE users] Machines constantly in Error state, and won't stay cleared...

sgenedharvey sge at nedharvey.com
Thu Jan 28 21:35:12 GMT 2010


> I have no execution directory on the machines that are failing.  I can't
> seem to figure out why...

Ok, I got the root cause of, and solution to this problem today.

Root cause:
When you run the kill script, /etc/init.d/sgeexecd.p700 stop
For some reason, sge_execd does not die.  Therefore, when I uninstall, and
rm all the files and everything ... And then I reinstall the execd, during
reinstallation, the new execd cannot bind to the listening port, and
therefore sge_execd does not create the execd_spool directory, and
naturally, no jobs can be run on that host, and the first job that tries
will fail, and the machine will go into 'E' Error state.

The solution, of course, is to run the stop script, and then manually kill
the daemon, before attempting to reinstall or restart it.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241596

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list