[GE users] How to clear internal hostname cache?

Andy Schwierskott andy.schwierskott at sun.com
Tue Mar 21 12:02:48 GMT 2006


Hi

did you already shutdown qmaster/schedd/commd and restart them?

qmaster/commd restart will happen pretty quick (unless you have zillions of
jobs in the cluster), so there will be only a short service interruption.

Note:  qconf -ks -km does not kill cod_commd, so you need to kill sge_commd
seperately. Once qmaster is down you can safely send a SIGKILL.

Of course the recommendation in the thread is right: Do a "grep" in the
default/common and qmaster spool directory to check for any old references
*in* the spoolded filed:

     cd <qmaster_spool_dir>
     grep network-0-0 *
     grep network-0-0 */*
     cd <sge_root>/<cell/common
     grep network-0-0 *
     grep network-0-0 */*

Andy

> Hi Andy,
>
> output as follows:
>
> [root at compute-0-7 lx24-amd64]# ./gethostbyname compute-0-7.local
> Hostname: compute-0-7.local
> Aliases:  compute-0-7
> Host Address(es): 10.255.255.247
>
> [root at compute-0-7 lx24-amd64]# ./gethostbyname compute-0-7
> Hostname: compute-0-7.local
> Aliases:  compute-0-7
> Host Address(es): 10.255.255.247
>
> [root at compute-0-7 lx24-amd64]# ./gethostbyaddr 10.255.255.247
> Hostname: compute-0-7.local
> Aliases:  compute-0-7
> Host Address(es): 10.255.255.247
>
>
> In case you are wondering how "network-0-0" came into the picture:
> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-March/017441.html
>
> Thread: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-March/thread.html#17441
>
>
> I guess "network-0-0" is defined somewhere but I'm out of my wits.
>
>
> Thanks,
> KL
>
>
> On 3/21/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote:
>> Kim Leng,
>>
>> qmaster sees the execution host 10.255.255.247 as host "network-0-0.local".
>>
>> The reason can be errors in the hostname resolving as Chris wrote or the
>> execution host has several network interfaces.
>>
>> What's the oputput on qmaster host when you enter:
>>
>>     <sge-root>/utilbin/<arch>/gethostbyname compute-0-7.local
>>     <sge-root>/utilbin/<arch>/gethostbyname compute-0-7
>>     <sge-root>/utilbin/<arch>/gethostbyaddr 10.255.255.247
>>
>> Andy
>>
>>
>>
>> On Tue, 21 Mar 2006, Kim Leng Goh wrote:
>>
>>> Hi Andy,
>>>  I do not have "127.0.0.1   localhost  compute-0-7.local" but:
>>>
>>> [root at compute-0-7 root]# head -5 /etc/hosts
>>> # Do not remove the following line, or various programs
>>> # that require network functionality will fail.
>>> 127.0.0.1 localhost.localdomain localhost
>>> 172.18.36.248 frontend.foo.com
>>> 10.255.255.247  compute-0-7.local  compute-0-7
>>>
>>>
>>> I changed the last line above to "10.255.255.247  compute-0-7" and
>>> "qstat -f" still returns:
>>>
>>> [root at compute-0-7 root]# qstat -f
>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>
>>>
>>> Thanks,
>>> KL
>>>
>>> On 3/21/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote:
>>>> Hi,
>>>>
>>>> the message
>>>>
>>>>>> This host has the local hostname >compute-0-7.local<.
>>>>
>>>> indicates the in /etc/hosts the actual hostname as an alias for
>>>>
>>>>    127.0.0.1   localhost  compute-0-7.local
>>>>
>>>> as this happens in some Linux distributions.
>>>>
>>>> Delete "compute-0-7.local" from that line and any other names but
>>>> "localhost" and you'll be fine regarding this error.
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> On Tue, 21 Mar 2006, Chris Dagdigian wrote:
>>>>
>>>>>
>>>>> I'm willing to bet that this hostname is defined somewhere on your system,
>>>>> I've wrestled with SGE hostname resolution issues on many clusters and in
>>>>> many complicated network, hostname and DNS resolving environments and the
>>>>> root cause for name issues was *always* external and not within SGE.
>>>>>
>>>>> I've also not seen caching activity do anything significant when making
>>>>> changes -- when I've fixed DNS or nameservice mistakes they are quickly
>>>>> picked up by SGE.
>>>>>
>>>>> You did not mention testing with the "gethostname" and "gethostbbyaddr" and
>>>>> the other utility binaries that should be in /opt/gridengine/utilbin/<arch>
>>>>> on your system. Try running those directly to see what SGE sees. After that,
>>>>> carefully make sure that what is in /etc/hosts matches what is being returned
>>>>> by forward and reverse DNS. Depending on your operating system there can also
>>>>> be other files and locations where hardcoded hostnames may be laying around.
>>>>>
>>>>>
>>>>> -Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mar 21, 2006, at 4:46 AM, Kim Leng Goh wrote:
>>>>>
>>>>>> Hi Christian,
>>>>>>   Thanks for the speedy reply.
>>>>>>
>>>>>> On 3/21/06, christian reissmann <Christian.Reissmann at sun.com> wrote:
>>>>>> [...]
>>>>>>>
>>>>>>> The cl_commlib.c module was developed for 6.0! The 5.3p6 version uses
>>>>>>> sge_commd to resolve hostnames and has no cache at all.
>>>>>>> So I don't understand the question.
>>>>>> [...]
>>>>>>
>>>>>> My problem is that SGE seems to think that my compute-0-7 node has the
>>>>>> hostname "network-0-0.local" when in fact it isn't (which prompted me
>>>>>> to think that this was in some cache somewhere or stored somewhere
>>>>>> else):
>>>>>>
>>>>>> [root at compute-0-7 root]# qstat -f
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>>
>>>>>> Reinstalling sge on the compute node or reinstalling the compute node
>>>>>> doesn't seem to help:
>>>>>>
>>>>>>
>>>>>> [root at compute-0-7 gridengine]# ./install_execd -auto
>>>>>>
>>>>>> Confirm Grid Engine default installation settings
>>>>>> -------------------------------------------------
>>>>>>
>>>>>> The following default settings can be used for an accelerated
>>>>>> installation procedure:
>>>>>>
>>>>>>       $SGE_ROOT          = /opt/gridengine
>>>>>>       service            = sge_commd
>>>>>>       admin user account = sge
>>>>>>
>>>>>> Do you want to use these configuration parameters (y/n) [y] >>
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>>
>>>>>>
>>>>>> Checking hostname resolving
>>>>>> ---------------------------
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>>
>>>>>> This host has the local hostname >compute-0-7.local<.
>>>>>>
>>>>>> This host is unknown on the qmaster host.
>>>>>>
>>>>>> Please make sure that you added this host as administrative host!
>>>>>> If you did not, please add this host now with the command
>>>>>>
>>>>>>    # qconf -ah HOSTNAME
>>>>>>
>>>>>> on your qmaster host.
>>>>>>
>>>>>> Check again (y/n) [y] >>
>>>>>>
>>>>>> Checking hostname resolving
>>>>>> ---------------------------
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>> denied: host "network-0-0.local" is neither submit nor admin host
>>>>>>
>>>>>>
>>>>>> This host has the local hostname >compute-0-7.local<.
>>>>>>
>>>>>> This host is unknown on the qmaster host.
>>>>>>
>>>>>> Please make sure that you added this host as administrative host!
>>>>>> If you did not, please add this host now with the command
>>>>>>
>>>>>>    # qconf -ah HOSTNAME
>>>>>>
>>>>>> on your qmaster host.
>>>>>>
>>>>>> If this host is already added as administrative host on your qmaster host
>>>>>> there may be a hostname resolving problem on this machine.
>>>>>>
>>>>>> Please check your >/etc/hosts< file and >/etc/nsswitch.conf< file.
>>>>>>
>>>>>> Hostname resolving problems will cause the problem that the
>>>>>> execution host will not be accepted by qmaster. Qmaster will
>>>>>> receive no load report values and show a load value
>>>>>> (>load_avg<) of 99.99 for this host.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list