[GE users] Help. My cluster has fallen and can get back up
dag at sonsorol.org
Fri Jul 10 15:24:17 BST 2009
There should be good information in the SGE spool logs ($SGE_ROOT/
$SGE_CELL/spool/qmaster/messages ) and if all else fails you should
look for log files in /tmp/ which is the panic log location SGE uses
when all else fails.
On Jul 10, 2009, at 10:04 AM, brettgrant99 wrote:
> I have a cluster of mac xserves running sge6.1. It has been up and
> working fairly well for more than a year now. Last week I submitted
> a new kind of job that is very disk I/O intensive and I think that
> it has caused some problems.
> Anyway, the headnode, where sgemaster runs, became totally
> nonresponsive yesterday, so it was given a soft reboot. When I
> attempted to start up the sgemaster, it gives:
> starting sge_qmaster
> starting sge_schedd
> and then it hangs for a while and then gives:
> daemonize error: timeout while waiting for daemonize state
> error: getting configuration: failed receiving gdi request
> error: can't get configuration from qmaster -- backgrounding
> the entire time the sge_qmaster process takes about 62% of the
> processor. I let it do this for about 24 hours before I tried to
> stop the process. The script says that the qmaster and scheduler
> are stopped, but they still have active processes. I can kill them
> only with a -9 argument.
> I get the above errors every time I try to restart the qmaster. The
> only file that seems to change is the heartbeat. All of the other
> files appear to have last changed right before the server became non-
> The last time that I looked while the server was working there were
> about 14k jobs in the que. I don't mind loosing the jobs, but would
> prefer not to have to reinstall gridware, which was the only
> solution that I found from searching. Perhaps I am not searching
> for the right stuff.
> For the most part, sge installed just fine and has worked fairly
> trouble-free, so I don't really know the ins and outs to diagnose
> problems, so any help would be appreciated.
> Brett Grant
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users