[GE users] Help. My cluster has fallen and can get back up

brettgrant99 grantb at grantb.org
Fri Jul 10 15:34:15 BST 2009


There are no messages in the qmaster/messages file.  I didn't realize that it would put stuff in /tmp.  However, looking at the files, they have the same messages that I gave in my original email, with the exception of:

07/09/2009 09:05:56|schedd|hoth|E|commlib error: got read error (closing "hoth.tuc.us.ray.com/qmaster/1")
07/09/2009 09:05:57|schedd|hoth|E|commlib error: can't connect to service (Connection refused)
07/09/2009 09:10:07|schedd|hoth|W|daemonize error: timeout while waiting for daemonize state
07/09/2009 09:10:10|schedd|hoth|E|getting configuration: failed receiving gdi request
07/09/2009 09:13:22|schedd|hoth|E|can't get configuration from qmaster -- backgrounding

Thanks,
Brett Grant



craffi <dag at sonsorol.org>

07/10/09 07:25 AM
Please respond to
users <users at gridengine.sunsource.net>

To
users at gridengine.sunsource.net
cc
Subject
Re: [GE users] Help.  My cluster has fallen and can get back up





There should be good information in the SGE spool logs ($SGE_ROOT/
$SGE_CELL/spool/qmaster/messages ) and if all else fails you should
look for log files in /tmp/ which is the panic log location SGE uses
when all else fails.


On Jul 10, 2009, at 10:04 AM, brettgrant99 wrote:

> I have a cluster of mac xserves running sge6.1.  It has been up and
> working fairly well for more than a year now.  Last week I submitted
> a new kind of job that is very disk I/O intensive and I think that
> it has caused some problems.
>
> Anyway, the headnode, where sgemaster runs, became totally
> nonresponsive yesterday, so it was given a soft reboot.  When I
> attempted to start up the sgemaster, it gives:
>
>   starting sge_qmaster
>   starting sge_schedd
>
> and then it hangs for a while and then gives:
>
>   daemonize error: timeout while waiting for daemonize state
>   error: getting configuration: failed receiving gdi request
>   error: can't get configuration from qmaster -- backgrounding
>
> the entire time the sge_qmaster process takes about 62% of the
> processor.  I let it do this for about 24 hours before I tried to
> stop the process.  The script says that the qmaster and scheduler
> are stopped, but they still have active processes.  I can kill them
> only with a -9 argument.
>
> I get the above errors every time I try to restart the qmaster.  The
> only file that seems to change is the heartbeat.  All of the other
> files appear to have last changed right before the server became non-
> responsive.
>
> The last time that I looked while the server was working there were
> about 14k jobs in the que.  I don't mind loosing the jobs, but would
> prefer not to have to reinstall gridware, which was the only
> solution that I found from searching.  Perhaps I am not searching
> for the right stuff.
>
> For the most part, sge installed just fine and has worked fairly
> trouble-free, so I don't really know the ins and outs to diagnose
> problems, so any help would be appreciated.
>
> Thanks,
> Brett Grant
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206422

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

The following line is added for your protection and will be used for analysis if this message is reported as spam:

(Raytheon Analysis: IP=204.16.104.118; e-from=users-return-206422+brett_w_grant=raytheon.com at gridengine.sunsource.net; from=dag at sonsorol.org; date=Jul 10, 2009 2:24:45 PM; subject=Re: [GE users] Help.  My cluster has fallen and can get back up)






More information about the gridengine-users mailing list