[GE users] Help. My cluster has fallen and can get back up

craffi dag at sonsorol.org
Fri Jul 10 15:24:17 BST 2009


There should be good information in the SGE spool logs ($SGE_ROOT/ 
$SGE_CELL/spool/qmaster/messages ) and if all else fails you should  
look for log files in /tmp/ which is the panic log location SGE uses  
when all else fails.


On Jul 10, 2009, at 10:04 AM, brettgrant99 wrote:

> I have a cluster of mac xserves running sge6.1.  It has been up and  
> working fairly well for more than a year now.  Last week I submitted  
> a new kind of job that is very disk I/O intensive and I think that  
> it has caused some problems.
>
> Anyway, the headnode, where sgemaster runs, became totally  
> nonresponsive yesterday, so it was given a soft reboot.  When I  
> attempted to start up the sgemaster, it gives:
>
>   starting sge_qmaster
>   starting sge_schedd
>
> and then it hangs for a while and then gives:
>
>   daemonize error: timeout while waiting for daemonize state
>   error: getting configuration: failed receiving gdi request
>   error: can't get configuration from qmaster -- backgrounding
>
> the entire time the sge_qmaster process takes about 62% of the  
> processor.  I let it do this for about 24 hours before I tried to  
> stop the process.  The script says that the qmaster and scheduler  
> are stopped, but they still have active processes.  I can kill them  
> only with a -9 argument.
>
> I get the above errors every time I try to restart the qmaster.  The  
> only file that seems to change is the heartbeat.  All of the other  
> files appear to have last changed right before the server became non- 
> responsive.
>
> The last time that I looked while the server was working there were  
> about 14k jobs in the que.  I don't mind loosing the jobs, but would  
> prefer not to have to reinstall gridware, which was the only  
> solution that I found from searching.  Perhaps I am not searching  
> for the right stuff.
>
> For the most part, sge installed just fine and has worked fairly  
> trouble-free, so I don't really know the ins and outs to diagnose  
> problems, so any help would be appreciated.
>
> Thanks,
> Brett Grant
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206422

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list