[GE users] Exec daemon can't resolve master hostname
andrew.reid at nist.gov
Thu Jan 7 21:59:46 GMT 2010
Hi all --
I am attempting to install the Debian-packaged version of the Sun GridEngine on a group of Debian "lenny" machines, and have run into a problem.
The configuration is, the master host has two network interfaces, a "front", routable one, and a "back" one, 192.168.0.206, connected to the cluster subnet. The "front" interface IP address has a DNS host name, and $SGE_ROOT/utilbin/lx26-amd64/gethostname reports this name. The "back" interface has a name assigned via the /etc/hosts file.
There is one submit host (so far), and it's similarly configured.
The exec hosts are all on the private subnet, and can only see the "back" of the master host. All of the hosts, master, submit, and exec, are configured to use the /etc/hosts name of the "back" master interface as the master.
But, after running for two minutes, the exec daemons report:
> E can't send asynchronous message to commproc (qmaster:1) on host "<configured-master-name>": can't resolve host name
Following this, the host disappears from the queue, and jobs can no longer be run.
The cluster sub-net network configuration appears to be fine. I can ping the master host by name, and I can ssh to it. /etc/nsswitch.conf is set up for "files" name resolution on the exec hosts. The sge-provided gethostbyname and gethostbyaddr give answers that are consistent and correct on the exec hosts.
The only possible sources of trouble I can see are, firstly, that the master host's gethostname gives an answer which is not consistent with the configured master host name, and secondly, in the exec host's /etc/hosts files, some of the aliases for the master host are the same as the master host's DNS name, i.e. that of the "front" interface.
I am perplexed, and would be grateful for any extra clues...
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users