[GE users] SGE 6.0u1 problem with name resolution
Ed.Struzynski at rlx.com
Mon Nov 29 14:20:36 GMT 2004
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I'm having a problem when configuring a Linux based SGE 6.0u1 cluster. I have scripts which I run to automatically add/remove hostgroups and queues. These normally work but once in a while (1 time in 4) after I run them I will have no queues or hostgroups. Examining the output of the sge commands I run, I find something like:
# qconf -Ahgrp /tmp/hostgroup
unable to resolve host "rlx-100-190-1"
Checking the /etc/hosts file, I find all of my hostname entries are correct. I've set /etc/nsswitch.conf to be "files" only for hosts. When I run the sge utilbin gethostbyname program just before my qconf in the script, it prints out the correct data. The same is true if I grep for the hostname in the /etc/hosts files. I can ping the host I'm trying to add with no problems in the script. But the qconf still fails.
Usually, after the script fails I can rerun the failed command by hand and it will work.
The only way I've found to reliably solve this problem is to turn on DNS and have it supply the hostname. So what I'm seeing is that SGE will use DNS to resolve hostnames (*sometimes*) even when the system is explicitly setup to not use DNS. Note that even if DNS is turned on and the order in nsswitch.conf is files,dns, the hostname must be resolvable in DNS for sge to work, just having an entry in hosts still does not work.
More information about the gridengine-users