[GE users] access denied (client IP resolved to host name "". This is not identical to clients host name "")

Chris Dagdigian dag at sonsorol.org
Mon May 8 15:41:51 BST 2006


Hi John,

I've posted about this error before when I was struggling with it  
showing up on maybe 10% of the Apple OS X based SGE systems I've been  
involved with.

The main points we were able to discover after much trial and error:

  - No correlation to apple CPU arch (g4 vs g5)
  - No correlation to OS X version number or installed update patches
  - All systems showing the problem had good forward/reverse DNS ability
  - The utility binaries in $SGE_ROOT/utilbin/ were all able to  
successfully query DNS forward and reverse
  - The command that generated the error will usually work if tried a  
minute or so  later
  - Only seen on Apple OS X (until now!)

As to making the problem go away on the affected systems:

  - On ~50% of the systems showing the problem, a simple upgrade to  
the latest courtesy binaries fixed things
  - The remaining 50% required a "build SGE from source-code on the  
affected system" approach

So the good news is I've never worked on a system where I was not  
able to make the problem go away but at the extreme end I had to  
build grid engine from source on the affected system. My email from  
last week was the first time I had seen the issue on Linux

It was also the first time that a change to /etc/hosts seems to have  
cleared things up.  The specific change I made was making sure that  
there was a proper /etc/hosts entry defined that listed *exactly* the  
machine name that Grid Engine thinks it is using (ie the one defined  
in $SGE_ROOT/default/common/act_qmaster).  I'm sure though that we  
had /etc/hosts set up this way on the Apple systems which still had  
the problem which is why I was pleasantly surprised that the issue  
cleared itself on the Opteron/Centos system after the hosts edit.

Regards,
Chris



On May 8, 2006, at 10:11 AM, John Saalwaechter wrote:

> For what it's worth, my qmaster system has this same problem
> all the time.  I'd say that more than 50% of the time any
> SGE command run from the qmaster results in the error message.
> This problem only happens on our qmaster, so I've worked around
> it by always using another host to do SGE admin work.
>
> Of note is the fact that this is a SPARC V880 running Solaris 9
> and N1GE 6.0u4.  Like Chris, I've checked and rechecked all
> DNS and /etc/hosts entries, but I cannot find any problems there.
>
> Chris -- can you explain in more detail your comments below
> about /etc/hosts?  My system is not behind any private network,
> but we do have a private link on this host for NFS connectivity
> to $SGE_ROOT.
>
> Also, when I get the error, it's also accompanied by this:
> ERROR: unable to contact qmaster using port 537 on host "xxx"
>
> John
>
> --- Chris Dagdigian <dag at sonsorol.org> wrote:
>
>>
>> I got lucky today.
>>
>> For the first time ever on a non-Apple OS X system I was able to
>> recreate the mysterious
>>
>>   access denied (client IP resolved to host name "". This is not
>> identical to clients host name "")
>>
>> ... error
>>
>> To further make things more fun, the error condition also produces
>> another bug-worthy case of non-compliant XML output, the empty "<>"
>> tags break automated XML parsers.
>>
>> Check this out:
>>
>>> [dag at test xmlqstat]$ qstat -f -xml -j 1
>>> error: commlib error: access denied (client IP resolved to host
>>> name "". This is not identical to clients host name "")
>>> <?xml version='1.0'?>
>>> <comunication_error  xmlns:xsd="http://www.w3.org/2001/XMLSchema">
>>>   <>
>>>     <AN_status>11</AN_status>
>>>     <AN_text>unable to contact qmaster using port 701 on host
>>> "test.gridengine.info"</AN_text>
>>>     <AN_quality>0</AN_quality>
>>>   </>
>>> </comunication_error>
>>> *** glibc detected *** double free or corruption (fasttop):
>>> 0x0000000040254440 ***
>>> Aborted
>>> [dag at test xmlqstat]$
>>
>> This was in the qmaster messages spool file:
>>> 05/04/2006 17:39:43|qmaster|test|E|commlib error: local host name
>>> error (can't resolve client IP address)
>>
>> This is on a single CPU Opteron system running Centos 4 and SGE
>> courtesy binaries downloaded about 30 minutes ago (SGE 6.0u7)
>>
>> This system has good DNS and working utilbin/ binaries but it did not
>>
>> have an entry in /etc/hosts with the public IP and fully qualified
>> hostname.
>>
>> Shortly after making the /etc/hosts entry the problem went away.
>>
>> In my experience with this error in the past, its always been a
>> transient "comes and goes" issue. I'm hoping the /etc/hosts addition
>>
>> resolved the problem but it would also be nice if it does not since
>> this is a testing box that I can use for further tracing and
>> debugging if needed. I'm also going to see if I can find the bits of
>>
>> source code that may be producing the bad XML output for this error
>> condition.
>>
>> -Chris
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> --
> johnsaalwaechter at yahoo.com
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list