[GE users] How to configure one execution host for two qmasters?

Colvin, Joshua jcolvin at sfwmd.gov
Wed Jul 11 12:29:36 BST 2007


Hi Dan,
Yes, I've reinstalled the master with no luck and also recursively
grepped to try and figure out where the software is getting these
port numbers from. I found out (see below) where the execution hosts
can get the port number, but (see below) it's not consistent or works
very well. At this point I'm installing the latest grid version to see
if that behaves any differently. Thanks for the help.
Josh


-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Monday, July 09, 2007 4:00 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] How to configure one execution host for two
qmasters?

Josh,

Have you tried reinstalling the qmaster with the new port numbers set in

your environment?  The qmaster install writes the cell files, like the 
settings.[c]sh scripts and the sgemaster and sgeexecd scripts.  The 
installer may be getting the port number from one of those files.  You 
could also try "grep 536 $SGE_ROOT/$SGE_CELL/common/*" to see if the 
port is set somewhere there.  Regardless, though, I think you'll have to

reinstall the master.  The port numbers get encoded into some of the 
paths, so reinstalling will be safer than editing out the old port
numbers.

Daniel

Colvin, Joshua wrote:
> Hmmm, I'm just not seeing any settings for port 536 (what the
> install is trying to use). I even manually set SGE_QMASTER_PORT
> and SGE_EXECD_PORT, yet the install still uses port 536, even 
> though running the same command manually works fine:
>
> This is on the execution node:
> [root at dcluster28 sge]# grep -i sge /etc/services
> [root at dcluster28 sge]# set|grep -i sge
> LD_LIBRARY_PATH=/grid/sge/lib/lx24-amd64
>
MANPATH=/grid/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/
> man:/usr/X11R6/man
> OLDPWD=/grid/sge
>
PATH=/grid/sge/bin/lx24-amd64:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/
>
local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/
> root/bin
> PWD=/grid/sge
> SGE_CELL=default-2b
> SGE_EXECD_PORT=539
> SGE_QMASTER_PORT=538
> SGE_ROOT=/grid/sge
> .
> .
> .
> .
> Checking hostname resolving
> ---------------------------
>
> Cannot contact qmaster. The command failed:
>
>    ./bin/lx24-amd64/qconf -sh
>
> The error message was:
>
>    error: commlib error: can't connect to service (Connection refused)
> ERROR: unable to contact qmaster using port 536 on host "dcluster2b"
>
> You can fix the problem now or abort the installation procedure.
> The problem can be:
>
>    - the qmaster is not running
>    - the qmaster host is down
>    - an active firewall blocks your request
>
> Contact qmaster again (y/n) ('n' will abort) [y] >>
> [root at dcluster28 sge]# ./bin/lx24-amd64/qconf -sh
> dcluster28.dcluster.gov
> dcluster2b
> [root at dcluster28 sge]#
>
>
>
> I'm also not seeing the settings.sh script set any port values in 
> the environment:
>
> [root at dcluster28 ~]# set|grep SGE
> [root at dcluster28 ~]# . /grid/sge/default-2b/common/settings.sh
> [root at dcluster28 ~]# set|grep SGE
> SGE_CELL=default-2b
> SGE_ROOT=/grid/sge
> [root at dcluster28 ~]#
>
> Note /grid/sge is nfs mounted if that matters.
>
> I reinstalled both qmaster and the execution node with no luck either,
> for some reason the execution node keeps trying to use 536 (the other
> qmaster port), not what's set in my environment. If I add the new
ports
> to /etc/services on the execution node, the new sge_execd processes
> start up fine. However stopping the other sge processes (the ones
using
> port 536/537) fails, complaining it can't reach the 'default' cell on
> that port value (which is correct).
>
> I had to add the ports to /etc/services, start the new services, then
> erase
> them from /etc/services so the other sge execd process will work.
> However
> there should be an easier way.
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Monday, July 09, 2007 2:38 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] How to configure one execution host for two
> qmasters?
>
> Josh,
>
> First, the execd port number used during the execd install must be the

> same as the execd port number used during the qmaster install.
Second, 
> there are only two sources for the port numbers.  Either they come
from 
> the SGE_QMASTER_PORT and SGE_EXECD_PORT environment variables (set by 
> the settings.[c]sh scripts), or they're set in /etc/services (grep sge

> /etc/services).  The former gets set either by having those variables 
> defined when you do the install, or by telling the installer the port 
> numbers you want to use when it asks (which is only an option if the 
> installer can't find the port numbers in the env vars or 
> /etc/services).  For /etc/services, note that each daemon will look at

> its localhost's /etc/services file.
>
> Daniel
>
> Colvin, Joshua wrote:
>   
>> Thanks Dan. Unfortunately I can install sge execd without being
>> prompted for port numbers (6.0u8), nothing but SGE_CELL and
>> SGE_ROOT are defined in my environment, and nothing is in
>>     
> /etc/services
>   
>> for the execution node.
>>
>> However the execution node is running jobs fine for qmaster #1, just
>> not for qmaster #2 (fails silently). If I change the ports in the 
>> qmaster's /etc/services file and reinstall the SW on the execution
>> node, the execution node can't talk to qmaster at all:
>>
>>    error: commlib error: can't connect to service (Connection
refused)
>> ERROR: unable to contact qmaster using port 536 on host "dcluster2b"
>>
>> so I'm wondering where it gets these port numbers to try from? 
>> set|grep -i sge| returns nothing but SGE_CELL and SGE_ROOT. I've grep

>> -R port in the home directory of SGE_ROOT with no luck.
>>
>>
>>
>> -----Original Message-----
>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
>> Sent: Monday, July 09, 2007 1:45 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] How to configure one execution host for two
>> qmasters?
>>
>> Josh,
>>
>> The ports are defined during installation.  Before running the
>>     
> install, 
>   
>> you can set the SGE_QMASTER_PORT and/or SGE_EXECD_PORT to force the 
>> installer to use those port numbers.  Otherwise it will take the port

>> numbers defined in /etc/services, or it will ask you for port numbers
>>     
> if
>   
>> none are defined in /etc/services.
>>
>> Daniel
>>
>> Colvin, Joshua wrote:
>>   
>>     
>>> Hello all,
>>>
>>> I am replacing some servers and wanted to install a new parallel 
>>> cluster alongside the
>>>
>>> existing one. The qmasters will be different, but the execution
nodes
>>>       
>
>   
>>> (for now) will be
>>>
>>> the same. I see everything I'd expect from both qmasters (qstat -f 
>>> shows all the nodes
>>>
>>> I've configured for both), and I can submit jobs fine to the first 
>>> cluster I start, however
>>>
>>> the second sge execd process refuses to start on any execution node.
>>>       
> I
>   
>>>     
>>>       
>>   
>>     
>>> see no error
>>>
>>> msgs anywhere (stdout, spool, /var/log/messages), but I imagine it 
>>> can't bind to an
>>>
>>> already-used port, however I don't see where to define the port for 
>>> sge execd (not in
>>>
>>> /etc/init.d, etc...).
>>>
>>>  
>>>
>>> Is there any trick to getting one execution host to be a member of 
>>> multiple clusters?
>>>
>>> Thanks!
>>>
>>> Josh
>>>
>>>     
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>   
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list