[GE users] "commlib error: access denied" with 6.2 install_execd

Harry Mangalam harry.mangalam at uci.edu
Tue Oct 28 22:53:13 GMT 2008


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reuti's advice was spot on.  There were at least 4 problems.  

1 - the 2 interfaces Reuti mentioned, addressed by making explicit 
entries in the host file used to generate the YP maps

2 - Using YP in the 1st place - I didn't understand how it was 
configured.

3 -  Unknown to me, the admins set up the cluster execution nodes to 
use kerberos to allow direct login to them.  The kerberos makes 
external nameservice calls and since the calls are tunneled thru the 
login node, they all resolved to issuing node, hence the strange 
messsage.  When we turned off the kerberos service, SGE started 
behaving (qhost gave identical responses on the master and exec 
nodes).

4 - I'm not sure why, but the nodes need explicit local net entries 
in /etc/hosts for the Q master.   It seems that they shouldn't, as 
they use YP maps, but nodes that /don't/ have it /don't/ work and the 
few nodes that /do/ have it /do/ work.  This may be a holdover from 
various other problems.  We'll see if it maintains over a reboot.

Thanks, Reuti!


On Tuesday 28 October 2008, Reuti wrote:
> Hi Harry,
>
> Am 28.10.2008 um 17:58 schrieb Harry Mangalam:
> > Hi All,
> >
> > Background follows immediately, real issue at **ISSUE** below
> >
> > I'm an SGE newbie, and have installed sge6.2 on a test cluster
> > that includes a login node & scheduler node, each with an
> > external i/f's and an internal interface (192.168.0.xxx) to the
> > compute nodes, which only have private i/f's.
> >
> > root can ssh to all nodes without passwords and the SGE_ROOT is
> > mounted by all nodes.
> >
> > I've done both interactive installs and config-file-based
> > installs and the qmaster installs fine on the sched node with the
> > command:
> >
> >  cd $SGE_ROOT; sudo ./inst_sge -m -auto /path/to/sge.conf
> >
> >
> > However, the execution nodes fail to install with the
> > corresponding command:
> >
> >  cd $SGE_ROOT; sudo ./inst_sge -x -auto /path/to/sge.conf
> >
> > with almost no feedback.  The log file produces output that
> > implies that it has worked:
> > <extract>
> >       8 bduc-i32-1
> >       9 remote execd installation on host bduc-i32-1
> >      10 adminhost "bduc-i32-1" already exists
> >      11 bduc-i32-2
> >      12 remote execd installation on host bduc-i32-2
> >      13 adminhost "bduc-i32-2" already exists
> > </extract>
> >
> > but also produces output like this:
> >
> > <extract>
> >      47 Reading configuration from file
> > /home/hmangala/sge_bducs.conf 48 gethostbyaddr() took 20 seconds
> > and returns success 49
> >      50 do_ypcall: clnt_call: RPC: Timed out
> >      51 do_ypcall: clnt_call: RPC: Timed out
> >      52 gethostbyaddr() took 25 seconds and returns success
> >      53
> >      54 do_ypcall: clnt_call: RPC: Timed out
> >      55 do_ypcall: clnt_call: RPC: Timed out
> >      56 do_ypcall: clnt_call: RPC: Timed out
> >      57 do_ypcall: clnt_call: RPC: Timed out
> >      58 TERM environment variable not set.
> >      59 gethostbyaddr() took 30 seconds and returns success
> >      60
> >      61 gethostbyaddr() took 30 seconds and returns success
> >      62
> >      63 gethostbyaddr() took 30 seconds and returns success
> > </extract>
> >
> > but implies that it was successful in the end.
> >
> > However, there's no SGE rc startup script written to the
> > execution nodes and no sge_execd running on them afterwards.
> >
> >
> > **ISSUE**
> >
> > Another post suggested running qhost to check the connections
> > among hosts and that seems to have identified the problem:
> >
> > from the qmaster host (sched):
> > $ qhost
> > HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
> > SWAPTO  SWAPUS
> > -----------------------------------------------------------------
> >----- ---------
> > global                  -               -     -       -
> > -       -       -
> >
> > from an execution host:
> > $ qhost
> > error: commlib error: access denied (client IP resolved to host
> > name "bduc-login.nacs.uci.edu". This is not identical to clients
> > host name "bduc-i32-2")
> > error: unable to contact qmaster using port 536 on
> > host "bduc-sched.nacs.uci.edu"
> >
> > For some reason the execution node is being resolved to the login
> > node, but using nslookup and the included gethost... utils, the
> > name & IP resolve correctly both ways on the login host, the
> > sched host, and the execution host.
> >
> > The ports 536/7 are entered in the /etc/services on the sched
> > host, but not on the exec hosts - is that be a problem?  They are
> > defined as 536/7 in the config file.
> >
> >
> > Are there other suggestions to resolve this?  This appears to be
> > similar to the issue as raised in:
> > <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1661>
> > and
> > <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1358>
> > but this one involves hostnames in conflict, not IP# <->
> > hostnames.
> >
> > but all references in the config file are by hostname (and the
> > sched /etc/hosts file refers to them all by name as well (no
> > references to naked IP #s being used)
>
> please have a look here:
>
> http://gridengine.sunsource.net/howto/multi_intrfcs.html
>
> you will need two entries for the login and scheduler node.
>
>
> You might also check (you are using NIS?), that the traffic from
> the login node to the scheduler node is also using the secondary
> interface with a rule in the routing like:
>
> external.scheduler.node.edu *               255.255.255.255 UH
> 0      0        0 eth1
>
> or alike.
>
> (Suppose the NIS is also running on the scheduler node, a password
> change from the login node wouldn't be possible otherwise [assuming
> NIS is only active on the private side].)
>
> -- Reuti
>
> > Suggestions?
> >
> > Best harry
> >
> > --
> > Harry Mangalam - Research Computing, NACS, E2148, Engineering
> > Gateway, UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
> > ---
> > Why not work towards banning gloomy marriages?
> >
> >
> > -----------------------------------------------------------------
> >---- To unsubscribe, e-mail:
> > users-unsubscribe at gridengine.sunsource.net For additional
> > commands, e-mail: users-help at gridengine.sunsource.net
>
> -------------------------------------------------------------------
>-- To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net For additional commands,
> e-mail: users-help at gridengine.sunsource.net



-- 
Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
---
Why not work towards banning gloomy marriages?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list