[GE users] "commlib error: access denied" with 6.2 install_execd

Harry Mangalam harry.mangalam at uci.edu
Tue Oct 28 16:58:44 GMT 2008

Hi All,

Background follows immediately, real issue at **ISSUE** below

I'm an SGE newbie, and have installed sge6.2 on a test cluster that 
includes a login node & scheduler node, each with an external i/f's 
and an internal interface (192.168.0.xxx) to the compute nodes, which 
only have private i/f's.

root can ssh to all nodes without passwords and the SGE_ROOT is 
mounted by all nodes.

I've done both interactive installs and config-file-based installs and 
the qmaster installs fine on the sched node with the command:

 cd $SGE_ROOT; sudo ./inst_sge -m -auto /path/to/sge.conf

However, the execution nodes fail to install with the corresponding 

 cd $SGE_ROOT; sudo ./inst_sge -x -auto /path/to/sge.conf
with almost no feedback.  The log file produces output that implies 
that it has worked:
      8 bduc-i32-1
      9 remote execd installation on host bduc-i32-1
     10 adminhost "bduc-i32-1" already exists
     11 bduc-i32-2
     12 remote execd installation on host bduc-i32-2
     13 adminhost "bduc-i32-2" already exists

but also produces output like this:

     47 Reading configuration from file /home/hmangala/sge_bducs.conf
     48 gethostbyaddr() took 20 seconds and returns success
     50 do_ypcall: clnt_call: RPC: Timed out
     51 do_ypcall: clnt_call: RPC: Timed out
     52 gethostbyaddr() took 25 seconds and returns success
     54 do_ypcall: clnt_call: RPC: Timed out
     55 do_ypcall: clnt_call: RPC: Timed out
     56 do_ypcall: clnt_call: RPC: Timed out
     57 do_ypcall: clnt_call: RPC: Timed out
     58 TERM environment variable not set.
     59 gethostbyaddr() took 30 seconds and returns success
     61 gethostbyaddr() took 30 seconds and returns success
     63 gethostbyaddr() took 30 seconds and returns success

but implies that it was successful in the end.

However, there's no SGE rc startup script written to the execution 
nodes and no sge_execd running on them afterwards.


Another post suggested running qhost to check the connections among 
hosts and that seems to have identified the problem:

from the qmaster host (sched):
$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  
global                  -               -     -       -       -       -       -

from an execution host:
$ qhost
error: commlib error: access denied (client IP resolved to host 
name "bduc-login.nacs.uci.edu". This is not identical to clients host 
name "bduc-i32-2")
error: unable to contact qmaster using port 536 on 
host "bduc-sched.nacs.uci.edu"

For some reason the execution node is being resolved to the login 
node, but using nslookup and the included gethost... utils, the name 
& IP resolve correctly both ways on the login host, the sched host, 
and the execution host.

The ports 536/7 are entered in the /etc/services on the sched host, 
but not on the exec hosts - is that be a problem?  They are defined 
as 536/7 in the config file.

Are there other suggestions to resolve this?  This appears to be 
similar to the issue as raised in:
but this one involves hostnames in conflict, not IP# <-> hostnames.

but all references in the config file are by hostname (and the 
sched /etc/hosts file refers to them all by name as well (no 
references to naked IP #s being used)


Best harry

Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
Why not work towards banning gloomy marriages?

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list