[GE users] "commlib error: access denied" with 6.2 install_execd

Reuti reuti at staff.uni-marburg.de
Tue Oct 28 17:21:18 GMT 2008


Hi Harry,

Am 28.10.2008 um 17:58 schrieb Harry Mangalam:

> Hi All,
>
> Background follows immediately, real issue at **ISSUE** below
>
> I'm an SGE newbie, and have installed sge6.2 on a test cluster that
> includes a login node & scheduler node, each with an external i/f's
> and an internal interface (192.168.0.xxx) to the compute nodes, which
> only have private i/f's.
>
> root can ssh to all nodes without passwords and the SGE_ROOT is
> mounted by all nodes.
>
> I've done both interactive installs and config-file-based installs and
> the qmaster installs fine on the sched node with the command:
>
>  cd $SGE_ROOT; sudo ./inst_sge -m -auto /path/to/sge.conf
>
>
> However, the execution nodes fail to install with the corresponding
> command:
>
>  cd $SGE_ROOT; sudo ./inst_sge -x -auto /path/to/sge.conf
>
> with almost no feedback.  The log file produces output that implies
> that it has worked:
> <extract>
>       8 bduc-i32-1
>       9 remote execd installation on host bduc-i32-1
>      10 adminhost "bduc-i32-1" already exists
>      11 bduc-i32-2
>      12 remote execd installation on host bduc-i32-2
>      13 adminhost "bduc-i32-2" already exists
> </extract>
>
> but also produces output like this:
>
> <extract>
>      47 Reading configuration from file /home/hmangala/sge_bducs.conf
>      48 gethostbyaddr() took 20 seconds and returns success
>      49
>      50 do_ypcall: clnt_call: RPC: Timed out
>      51 do_ypcall: clnt_call: RPC: Timed out
>      52 gethostbyaddr() took 25 seconds and returns success
>      53
>      54 do_ypcall: clnt_call: RPC: Timed out
>      55 do_ypcall: clnt_call: RPC: Timed out
>      56 do_ypcall: clnt_call: RPC: Timed out
>      57 do_ypcall: clnt_call: RPC: Timed out
>      58 TERM environment variable not set.
>      59 gethostbyaddr() took 30 seconds and returns success
>      60
>      61 gethostbyaddr() took 30 seconds and returns success
>      62
>      63 gethostbyaddr() took 30 seconds and returns success
> </extract>
>
> but implies that it was successful in the end.
>
> However, there's no SGE rc startup script written to the execution
> nodes and no sge_execd running on them afterwards.
>
>
> **ISSUE**
>
> Another post suggested running qhost to check the connections among
> hosts and that seems to have identified the problem:
>
> from the qmaster host (sched):
> $ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
>
> from an execution host:
> $ qhost
> error: commlib error: access denied (client IP resolved to host
> name "bduc-login.nacs.uci.edu". This is not identical to clients host
> name "bduc-i32-2")
> error: unable to contact qmaster using port 536 on
> host "bduc-sched.nacs.uci.edu"
>
> For some reason the execution node is being resolved to the login
> node, but using nslookup and the included gethost... utils, the name
> & IP resolve correctly both ways on the login host, the sched host,
> and the execution host.
>
> The ports 536/7 are entered in the /etc/services on the sched host,
> but not on the exec hosts - is that be a problem?  They are defined
> as 536/7 in the config file.
>
>
> Are there other suggestions to resolve this?  This appears to be
> similar to the issue as raised in:
> <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1661>
> and
> <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1358>
> but this one involves hostnames in conflict, not IP# <-> hostnames.
>
> but all references in the config file are by hostname (and the
> sched /etc/hosts file refers to them all by name as well (no
> references to naked IP #s being used)

please have a look here:

http://gridengine.sunsource.net/howto/multi_intrfcs.html

you will need two entries for the login and scheduler node.


You might also check (you are using NIS?), that the traffic from the  
login node to the scheduler node is also using the secondary  
interface with a rule in the routing like:

external.scheduler.node.edu *               255.255.255.255 UH     
0      0        0 eth1

or alike.

(Suppose the NIS is also running on the scheduler node, a password  
change from the login node wouldn't be possible otherwise [assuming  
NIS is only active on the private side].)

-- Reuti


>
> Suggestions?
>
> Best harry
>
> -- 
> Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway,
> UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
> ---
> Why not work towards banning gloomy marriages?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list