[GE users] Help. My cluster has fallen and can get back up
grantb at grantb.org
Fri Jul 10 16:12:20 BST 2009
I am the only one who admins this cluster, so no changes have been made.
At the time of the crash, there was probably 250-280 jobs running. My thought was that a database did get corrupt, but I am not sure at how to check or fix that. To be honest, I don't remember what I did when I installed it, probably classic spooling. How do I tell?
craffi <dag at sonsorol.org>
07/10/09 07:54 AM
Please respond to
users <users at gridengine.sunsource.net>
users at gridengine.sunsource.net
Re: [GE users] Help. My cluster has fallen and can get back up
The schedd messages are just saying that the qmaster refused
connections, surprised there are not qmaster messages indicating why
it is not starting.
I'd consider doing a hard reboot if possible and also really working
out that there are no existing zombie daemons blocking the TCP ports.
The CPU utilization does make me think that the qmaster is crunching
active and pending job data in the spool. Were there very many jobs
active when the system went down? Are you using classic or berkely db
spooling? Normally corrupt berkeley DB spool files generate very
obvious error messages though.
If your system has been up and running for a very long time without
issue I'd also look for subtle configuration changes that could have
entered your cluster over time that SGE would not notice until it was
restarted. Something as simple as a DNS change or someone manually
changing /etc/hosts on you could also be a culprit (although that sort
of stuff would not generate the CPU load you are seeing).
On Jul 10, 2009, at 10:34 AM, brettgrant99 wrote:
> There are no messages in the qmaster/messages file. I didn't
> realize that it would put stuff in /tmp. However, looking at the
> files, they have the same messages that I gave in my original email,
> with the exception of:
> 07/09/2009 09:05:56|schedd|hoth|E|commlib error: got read error
> (closing "hoth.tuc.us.ray.com/qmaster/1")
> 07/09/2009 09:05:57|schedd|hoth|E|commlib error: can't connect to
> service (Connection refused)
> 07/09/2009 09:10:07|schedd|hoth|W|daemonize error: timeout while
> waiting for daemonize state
> 07/09/2009 09:10:10|schedd|hoth|E|getting configuration: failed
> receiving gdi request
> 07/09/2009 09:13:22|schedd|hoth|E|can't get configuration from
> qmaster -- backgrounding
> Brett Grant
> craffi <dag at sonsorol.org>
> 07/10/09 07:25 AM
> Please respond to
> users <users at gridengine.sunsource.net>
> users at gridengine.sunsource.net
> Re: [GE users] Help. My cluster has fallen and can get back up
> There should be good information in the SGE spool logs ($SGE_ROOT/
> $SGE_CELL/spool/qmaster/messages ) and if all else fails you should
> look for log files in /tmp/ which is the panic log location SGE uses
> when all else fails.
> On Jul 10, 2009, at 10:04 AM, brettgrant99 wrote:
> > I have a cluster of mac xserves running sge6.1. It has been up and
> > working fairly well for more than a year now. Last week I submitted
> > a new kind of job that is very disk I/O intensive and I think that
> > it has caused some problems.
> > Anyway, the headnode, where sgemaster runs, became totally
> > nonresponsive yesterday, so it was given a soft reboot. When I
> > attempted to start up the sgemaster, it gives:
> > starting sge_qmaster
> > starting sge_schedd
> > and then it hangs for a while and then gives:
> > daemonize error: timeout while waiting for daemonize state
> > error: getting configuration: failed receiving gdi request
> > error: can't get configuration from qmaster -- backgrounding
> > the entire time the sge_qmaster process takes about 62% of the
> > processor. I let it do this for about 24 hours before I tried to
> > stop the process. The script says that the qmaster and scheduler
> > are stopped, but they still have active processes. I can kill them
> > only with a -9 argument.
> > I get the above errors every time I try to restart the qmaster. The
> > only file that seems to change is the heartbeat. All of the other
> > files appear to have last changed right before the server became
> > responsive.
> > The last time that I looked while the server was working there were
> > about 14k jobs in the que. I don't mind loosing the jobs, but would
> > prefer not to have to reinstall gridware, which was the only
> > solution that I found from searching. Perhaps I am not searching
> > for the right stuff.
> > For the most part, sge installed just fine and has worked fairly
> > trouble-free, so I don't really know the ins and outs to diagnose
> > problems, so any help would be appreciated.
> > Thanks,
> > Brett Grant
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
> The following line is added for your protection and will be used for
> analysis if this message is reported as spam:
> (Raytheon Analysis: IP=18.104.22.168; e-from=users-return-206422+brett_w_grant=raytheon.com at gridengine.sunsource.net
> ; from=dag at sonsorol.org; date=Jul 10, 2009 2:24:45 PM; subject=Re:
> [GE users] Help. My cluster has fallen and can get back up)
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
The following line is added for your protection and will be used for analysis if this message is reported as spam:
(Raytheon Analysis: IP=22.214.171.124; e-from=users-return-206426+brett_w_grant=raytheon.com at gridengine.sunsource.net; from=dag at sonsorol.org; date=Jul 10, 2009 2:51:56 PM; subject=Re: [GE users] Help. My cluster has fallen and can get back up)
More information about the gridengine-users