[GE users] "commlib error: access denied" with 6.2 install_execd

Reuti reuti at staff.uni-marburg.de
Tue Oct 28 23:13:06 GMT 2008


Am 28.10.2008 um 23:53 schrieb Harry Mangalam:

> Reuti's advice was spot on.  There were at least 4 problems.
>
> 1 - the 2 interfaces Reuti mentioned, addressed by making explicit
> entries in the host file used to generate the YP maps
> 2 - Using YP in the 1st place - I didn't understand how it was
> configured.
>
> 3 -  Unknown to me, the admins set up the cluster execution nodes to
> use kerberos to allow direct login to them.  The kerberos makes
> external nameservice calls and since the calls are tunneled thru the
> login node, they all resolved to issuing node, hence the strange
> messsage.  When we turned off the kerberos service, SGE started
> behaving (qhost gave identical responses on the master and exec
> nodes).
>
> 4 - I'm not sure why, but the nodes need explicit local net entries
> in /etc/hosts for the Q master.   It seems that they shouldn't, as
> they use YP maps, but nodes that /don't/ have it /don't/ work and the
> few nodes that /do/ have it /do/ work.  This may be a holdover from
> various other problems.  We'll see if it maintains over a reboot.

Harry,

it's good to know, that it was helpful :-)

There are two points:

a) What is in /etc/nsswitch on the nodes? Is "nis" mentioned there  
for the host entry? Is /etc/yp.conf using an address or a name?

b) Is NIS started before SGE during boot?

-- Reuti


> Thanks, Reuti!
>
>
> On Tuesday 28 October 2008, Reuti wrote:
>> Hi Harry,
>>
>> Am 28.10.2008 um 17:58 schrieb Harry Mangalam:
>>> Hi All,
>>>
>>> Background follows immediately, real issue at **ISSUE** below
>>>
>>> I'm an SGE newbie, and have installed sge6.2 on a test cluster
>>> that includes a login node & scheduler node, each with an
>>> external i/f's and an internal interface (192.168.0.xxx) to the
>>> compute nodes, which only have private i/f's.
>>>
>>> root can ssh to all nodes without passwords and the SGE_ROOT is
>>> mounted by all nodes.
>>>
>>> I've done both interactive installs and config-file-based
>>> installs and the qmaster installs fine on the sched node with the
>>> command:
>>>
>>>  cd $SGE_ROOT; sudo ./inst_sge -m -auto /path/to/sge.conf
>>>
>>>
>>> However, the execution nodes fail to install with the
>>> corresponding command:
>>>
>>>  cd $SGE_ROOT; sudo ./inst_sge -x -auto /path/to/sge.conf
>>>
>>> with almost no feedback.  The log file produces output that
>>> implies that it has worked:
>>> <extract>
>>>       8 bduc-i32-1
>>>       9 remote execd installation on host bduc-i32-1
>>>      10 adminhost "bduc-i32-1" already exists
>>>      11 bduc-i32-2
>>>      12 remote execd installation on host bduc-i32-2
>>>      13 adminhost "bduc-i32-2" already exists
>>> </extract>
>>>
>>> but also produces output like this:
>>>
>>> <extract>
>>>      47 Reading configuration from file
>>> /home/hmangala/sge_bducs.conf 48 gethostbyaddr() took 20 seconds
>>> and returns success 49
>>>      50 do_ypcall: clnt_call: RPC: Timed out
>>>      51 do_ypcall: clnt_call: RPC: Timed out
>>>      52 gethostbyaddr() took 25 seconds and returns success
>>>      53
>>>      54 do_ypcall: clnt_call: RPC: Timed out
>>>      55 do_ypcall: clnt_call: RPC: Timed out
>>>      56 do_ypcall: clnt_call: RPC: Timed out
>>>      57 do_ypcall: clnt_call: RPC: Timed out
>>>      58 TERM environment variable not set.
>>>      59 gethostbyaddr() took 30 seconds and returns success
>>>      60
>>>      61 gethostbyaddr() took 30 seconds and returns success
>>>      62
>>>      63 gethostbyaddr() took 30 seconds and returns success
>>> </extract>
>>>
>>> but implies that it was successful in the end.
>>>
>>> However, there's no SGE rc startup script written to the
>>> execution nodes and no sge_execd running on them afterwards.
>>>
>>>
>>> **ISSUE**
>>>
>>> Another post suggested running qhost to check the connections
>>> among hosts and that seems to have identified the problem:
>>>
>>> from the qmaster host (sched):
>>> $ qhost
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>> SWAPTO  SWAPUS
>>> -----------------------------------------------------------------
>>> ----- ---------
>>> global                  -               -     -       -
>>> -       -       -
>>>
>>> from an execution host:
>>> $ qhost
>>> error: commlib error: access denied (client IP resolved to host
>>> name "bduc-login.nacs.uci.edu". This is not identical to clients
>>> host name "bduc-i32-2")
>>> error: unable to contact qmaster using port 536 on
>>> host "bduc-sched.nacs.uci.edu"
>>>
>>> For some reason the execution node is being resolved to the login
>>> node, but using nslookup and the included gethost... utils, the
>>> name & IP resolve correctly both ways on the login host, the
>>> sched host, and the execution host.
>>>
>>> The ports 536/7 are entered in the /etc/services on the sched
>>> host, but not on the exec hosts - is that be a problem?  They are
>>> defined as 536/7 in the config file.
>>>
>>>
>>> Are there other suggestions to resolve this?  This appears to be
>>> similar to the issue as raised in:
>>> <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1661>
>>> and
>>> <http://gridengine.sunsource.net/issues/show_bug.cgi?id=1358>
>>> but this one involves hostnames in conflict, not IP# <->
>>> hostnames.
>>>
>>> but all references in the config file are by hostname (and the
>>> sched /etc/hosts file refers to them all by name as well (no
>>> references to naked IP #s being used)
>>
>> please have a look here:
>>
>> http://gridengine.sunsource.net/howto/multi_intrfcs.html
>>
>> you will need two entries for the login and scheduler node.
>>
>>
>> You might also check (you are using NIS?), that the traffic from
>> the login node to the scheduler node is also using the secondary
>> interface with a rule in the routing like:
>>
>> external.scheduler.node.edu *               255.255.255.255 UH
>> 0      0        0 eth1
>>
>> or alike.
>>
>> (Suppose the NIS is also running on the scheduler node, a password
>> change from the login node wouldn't be possible otherwise [assuming
>> NIS is only active on the private side].)
>>
>> -- Reuti
>>
>>> Suggestions?
>>>
>>> Best harry
>>>
>>> --
>>> Harry Mangalam - Research Computing, NACS, E2148, Engineering
>>> Gateway, UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
>>> ---
>>> Why not work towards banning gloomy marriages?
>>>
>>>
>>> -----------------------------------------------------------------
>>> ---- To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net For additional
>>> commands, e-mail: users-help at gridengine.sunsource.net
>>
>> -------------------------------------------------------------------
>> -- To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net For additional commands,
>> e-mail: users-help at gridengine.sunsource.net
>
>
>
> -- 
> Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway,
> UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
> ---
> Why not work towards banning gloomy marriages?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list