[GE users] new exec host grief
dag at sonsorol.org
Fri Feb 13 12:23:13 GMT 2009
The root cause seems to be that the compute node can't get to port 701
on host "linux6" - you should look into the standard firewall,
routing, DNS lookup and other issues that typically can cause "can't
get to host X, port Y" type problems.
- Double check that the exact hostname listed in $SGE_ROOT/$SGE_CELL/
common/act_qmaster is resolvable and that there are no typos in /etc/
hosts, based on your pasted output, it appears your master is called
- Verify that DNS is not giving different information than /etc/hosts
- Check /tmp for log messages from the sge_execd
- Check the spool logs for minitel
- Check the process table on minitel to make sure there are not old/
zombie sge_execd daemons still cluttering up things
- Check the sge_qmaster spool messages file just to see if there is
anything interesting there
On Feb 13, 2009, at 6:45 AM, lonegroover wrote:
> Trying to add a new execution host to my cluster is proving a shade
> awkward. I've added the new host to the grid according to the
> documentation, doing qconf -mq all.q, qconf -mhgrp @allhosts and
> even qconf -ah <new hostname>.
> I've added the relevant /etc/services entries on the new box, and
> given it the cluster name and qmaster name. It can resolve the
> qmaster name as is thanks to /etc/hosts.
> On the master I can see the host in the output of qstat -f, ie:
> all.q at minitel BIP 0/2 -NA- -
> NA- au
> .. but starting the gridengine daemon on the new box gives:
> root at minitel:/etc/init.d# ./gridengine-exec start
> error: can't connect to service
> error: can't get configuration from qmaster -- backgrounding
> error: getting configuration: unable to contact qmaster using port
> 701 on host "linux6"
> .. every time.
> Can anyone suggest a possible avenue of problem-solving opportunity?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users