[GE users] Help. My cluster has fallen and can get back up

craffi dag at sonsorol.org
Fri Jul 10 15:51:36 BST 2009


The schedd messages are just saying that the qmaster refused  
connections, surprised there are not qmaster messages indicating why  
it is not starting.

I'd consider doing a hard reboot if possible and also really working  
out that there are no existing zombie daemons blocking the TCP ports.

The CPU utilization does make me think that the qmaster is crunching  
active and pending job data in the spool. Were there very many jobs  
active when the system went down? Are you using classic or berkely db  
spooling? Normally corrupt berkeley DB spool files generate very  
obvious error messages though.

If your system has been up and running for a very long time without  
issue I'd also look for subtle configuration changes that could have  
entered your cluster over time that SGE would not notice until it was  
restarted. Something as simple as a DNS change or someone manually  
changing /etc/hosts on you could also be a culprit (although that sort  
of stuff would not generate the CPU load you are seeing).

-Chris


On Jul 10, 2009, at 10:34 AM, brettgrant99 wrote:

>
> There are no messages in the qmaster/messages file.  I didn't  
> realize that it would put stuff in /tmp.  However, looking at the  
> files, they have the same messages that I gave in my original email,  
> with the exception of:
>
> 07/09/2009 09:05:56|schedd|hoth|E|commlib error: got read error  
> (closing "hoth.tuc.us.ray.com/qmaster/1")
> 07/09/2009 09:05:57|schedd|hoth|E|commlib error: can't connect to  
> service (Connection refused)
> 07/09/2009 09:10:07|schedd|hoth|W|daemonize error: timeout while  
> waiting for daemonize state
> 07/09/2009 09:10:10|schedd|hoth|E|getting configuration: failed  
> receiving gdi request
> 07/09/2009 09:13:22|schedd|hoth|E|can't get configuration from  
> qmaster -- backgrounding
>
> Thanks,
> Brett Grant
>
>
>
> craffi <dag at sonsorol.org>
> 07/10/09 07:25 AM
> Please respond to
> users <users at gridengine.sunsource.net>
>
>
> To
> users at gridengine.sunsource.net
> cc
> Subject
> Re: [GE users] Help.  My cluster has fallen and can get back up
>
>
>
>
>
> There should be good information in the SGE spool logs ($SGE_ROOT/
> $SGE_CELL/spool/qmaster/messages ) and if all else fails you should
> look for log files in /tmp/ which is the panic log location SGE uses
> when all else fails.
>
>
> On Jul 10, 2009, at 10:04 AM, brettgrant99 wrote:
>
> > I have a cluster of mac xserves running sge6.1.  It has been up and
> > working fairly well for more than a year now.  Last week I submitted
> > a new kind of job that is very disk I/O intensive and I think that
> > it has caused some problems.
> >
> > Anyway, the headnode, where sgemaster runs, became totally
> > nonresponsive yesterday, so it was given a soft reboot.  When I
> > attempted to start up the sgemaster, it gives:
> >
> >   starting sge_qmaster
> >   starting sge_schedd
> >
> > and then it hangs for a while and then gives:
> >
> >   daemonize error: timeout while waiting for daemonize state
> >   error: getting configuration: failed receiving gdi request
> >   error: can't get configuration from qmaster -- backgrounding
> >
> > the entire time the sge_qmaster process takes about 62% of the
> > processor.  I let it do this for about 24 hours before I tried to
> > stop the process.  The script says that the qmaster and scheduler
> > are stopped, but they still have active processes.  I can kill them
> > only with a -9 argument.
> >
> > I get the above errors every time I try to restart the qmaster.  The
> > only file that seems to change is the heartbeat.  All of the other
> > files appear to have last changed right before the server became  
> non-
> > responsive.
> >
> > The last time that I looked while the server was working there were
> > about 14k jobs in the que.  I don't mind loosing the jobs, but would
> > prefer not to have to reinstall gridware, which was the only
> > solution that I found from searching.  Perhaps I am not searching
> > for the right stuff.
> >
> > For the most part, sge installed just fine and has worked fairly
> > trouble-free, so I don't really know the ins and outs to diagnose
> > problems, so any help would be appreciated.
> >
> > Thanks,
> > Brett Grant
> >
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206422
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].
>
> The following line is added for your protection and will be used for  
> analysis if this message is reported as spam:
>
> (Raytheon Analysis: IP=204.16.104.118; e-from=users-return-206422+brett_w_grant=raytheon.com at gridengine.sunsource.net 
> ; from=dag at sonsorol.org; date=Jul 10, 2009 2:24:45 PM; subject=Re:  
> [GE users] Help.  My cluster has fallen and can get back up)
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206426

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list