[GE users] error: commlib error: access denied (client IP resolved to host name "". (was Re: [GE users] Running MPICH jobs)

Chris Dagdigian dag at sonsorol.org
Tue May 2 14:21:25 BST 2006


This reply is not on-topic for this "running MPICH"  thread but I  
wanted to add my $.02 in here regarding this particular error message.

I see this error occasionally on Apple OS X based clusters, usually  
the main symptom is a SGE admin approaching us to say that "qstat"  
will fail at random intervals and then suddenly start working again  
within a minute or two. The specific error usually looks like this:


>> error: commlib error: access denied (client IP resolved to host name
>> "". This is not identical to clients host name "")
>> unable to contact qmaster using port 701 on host
>> "xxx.xxx(hostname deleted).xxx"


Whenever I've been able to login to the system in question I've been  
able to confirm the behavior -- sometimes qstat will work, sometimes  
it will not and will fail with the error noted above. I have  
collectively spent many days trying to fix the error shown below, it  
appears randomly on about 5% of the Apple OS X base clusters that I  
work on. I've never been able to correlate it to a particular system  
configuration and I've never been able to reproduce the error after  
"fixing" it.  The operating system version does not matter and the  
CPU arch (G4 vs G5) does not matter.

In all cases, forward and reverse DNS is functioning perfectly, both  
at the /etc/hosts and the DNS resolver levels.

in all cases all of the SGE utilbin/ binaries are also functioning  
perfectly and able to resolve names and IPs correctly and without error.

Over the past year or so, I've been able to fix this issue on about  
50% of the SGE systems showing the behavior simply by dropping new or  
updated courtesy binaries into place.  The remaining 50% of the  
clusters are not fixed by this and continue to show the odd behavior  
even when the latest binaries are dropped into place.

For those systems not fixed by new binaries, the only way (after  
*much* trial and error and experimentation) I've been able to  
conclusively make the problem go away is to build Grid Engine from  
source on the affected system. Hand-built binaries installed into  
$SGE_ROOT have always cleared the issue. This is the only "fix" that  
works for us right now for this particular issue.

This is a real issue that I've seen on multiple different (Apple)  
systems but since I can't figure out the root cause or "fix" it by  
any other means than rebuilding from sourcecode I've never filed an  
Issue report. If I ever learn more I'll open up something in the  
Issue tracker.

Anyway, like I said this is not on topic for the thread but the error  
message quoted below brought back bad memories (heh!) and I thought  
I'd send a note so it would get listed in the archives. Maybe this  
will help someone doing a google or archive search on  "access denied  
(client IP resolved to host name """ in the future.

-Chris



On May 2, 2006, at 5:33 AM, Reuti wrote:

>> error: commlib error: access denied (client IP resolved to host  
>> name "". This is not identical to clients host name "")

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list