[GE users] Help. My cluster has fallen and can get back up
grantb at grantb.org
Fri Jul 10 15:04:56 BST 2009
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I have a cluster of mac xserves running sge6.1. It has been up and working fairly well for more than a year now. Last week I submitted a new kind of job that is very disk I/O intensive and I think that it has caused some problems.
Anyway, the headnode, where sgemaster runs, became totally nonresponsive yesterday, so it was given a soft reboot. When I attempted to start up the sgemaster, it gives:
and then it hangs for a while and then gives:
daemonize error: timeout while waiting for daemonize state
error: getting configuration: failed receiving gdi request
error: can't get configuration from qmaster -- backgrounding
the entire time the sge_qmaster process takes about 62% of the processor. I let it do this for about 24 hours before I tried to stop the process. The script says that the qmaster and scheduler are stopped, but they still have active processes. I can kill them only with a -9 argument.
I get the above errors every time I try to restart the qmaster. The only file that seems to change is the heartbeat. All of the other files appear to have last changed right before the server became non-responsive.
The last time that I looked while the server was working there were about 14k jobs in the que. I don't mind loosing the jobs, but would prefer not to have to reinstall gridware, which was the only solution that I found from searching. Perhaps I am not searching for the right stuff.
For the most part, sge installed just fine and has worked fairly trouble-free, so I don't really know the ins and outs to diagnose problems, so any help would be appreciated.
More information about the gridengine-users