[GE users] How to configure one execution host for two qmasters?

Colvin, Joshua jcolvin at sfwmd.gov
Mon Jul 9 20:03:03 BST 2007


Hmmm, I'm just not seeing any settings for port 536 (what the
install is trying to use). I even manually set SGE_QMASTER_PORT
and SGE_EXECD_PORT, yet the install still uses port 536, even 
though running the same command manually works fine:

This is on the execution node:
[root at dcluster28 sge]# grep -i sge /etc/services
[root at dcluster28 sge]# set|grep -i sge
LD_LIBRARY_PATH=/grid/sge/lib/lx24-amd64
MANPATH=/grid/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/
man:/usr/X11R6/man
OLDPWD=/grid/sge
PATH=/grid/sge/bin/lx24-amd64:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/
local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/
root/bin
PWD=/grid/sge
SGE_CELL=default-2b
SGE_EXECD_PORT=539
SGE_QMASTER_PORT=538
SGE_ROOT=/grid/sge
.
.
.
.
Checking hostname resolving
---------------------------

Cannot contact qmaster. The command failed:

   ./bin/lx24-amd64/qconf -sh

The error message was:

   error: commlib error: can't connect to service (Connection refused)
ERROR: unable to contact qmaster using port 536 on host "dcluster2b"

You can fix the problem now or abort the installation procedure.
The problem can be:

   - the qmaster is not running
   - the qmaster host is down
   - an active firewall blocks your request

Contact qmaster again (y/n) ('n' will abort) [y] >>
[root at dcluster28 sge]# ./bin/lx24-amd64/qconf -sh
dcluster28.dcluster.gov
dcluster2b
[root at dcluster28 sge]#



I'm also not seeing the settings.sh script set any port values in 
the environment:

[root at dcluster28 ~]# set|grep SGE
[root at dcluster28 ~]# . /grid/sge/default-2b/common/settings.sh
[root at dcluster28 ~]# set|grep SGE
SGE_CELL=default-2b
SGE_ROOT=/grid/sge
[root at dcluster28 ~]#

Note /grid/sge is nfs mounted if that matters.

I reinstalled both qmaster and the execution node with no luck either,
for some reason the execution node keeps trying to use 536 (the other
qmaster port), not what's set in my environment. If I add the new ports
to /etc/services on the execution node, the new sge_execd processes
start up fine. However stopping the other sge processes (the ones using
port 536/537) fails, complaining it can't reach the 'default' cell on
that port value (which is correct).

I had to add the ports to /etc/services, start the new services, then
erase
them from /etc/services so the other sge execd process will work.
However
there should be an easier way.


-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Monday, July 09, 2007 2:38 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] How to configure one execution host for two
qmasters?

Josh,

First, the execd port number used during the execd install must be the 
same as the execd port number used during the qmaster install.  Second, 
there are only two sources for the port numbers.  Either they come from 
the SGE_QMASTER_PORT and SGE_EXECD_PORT environment variables (set by 
the settings.[c]sh scripts), or they're set in /etc/services (grep sge 
/etc/services).  The former gets set either by having those variables 
defined when you do the install, or by telling the installer the port 
numbers you want to use when it asks (which is only an option if the 
installer can't find the port numbers in the env vars or 
/etc/services).  For /etc/services, note that each daemon will look at 
its localhost's /etc/services file.

Daniel

Colvin, Joshua wrote:
> Thanks Dan. Unfortunately I can install sge execd without being
> prompted for port numbers (6.0u8), nothing but SGE_CELL and
> SGE_ROOT are defined in my environment, and nothing is in
/etc/services
> for the execution node.
>
> However the execution node is running jobs fine for qmaster #1, just
> not for qmaster #2 (fails silently). If I change the ports in the 
> qmaster's /etc/services file and reinstall the SW on the execution
> node, the execution node can't talk to qmaster at all:
>
>    error: commlib error: can't connect to service (Connection refused)
> ERROR: unable to contact qmaster using port 536 on host "dcluster2b"
>
> so I'm wondering where it gets these port numbers to try from? 
> set|grep -i sge| returns nothing but SGE_CELL and SGE_ROOT. I've grep 
> -R port in the home directory of SGE_ROOT with no luck.
>
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Monday, July 09, 2007 1:45 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] How to configure one execution host for two
> qmasters?
>
> Josh,
>
> The ports are defined during installation.  Before running the
install, 
> you can set the SGE_QMASTER_PORT and/or SGE_EXECD_PORT to force the 
> installer to use those port numbers.  Otherwise it will take the port 
> numbers defined in /etc/services, or it will ask you for port numbers
if
>
> none are defined in /etc/services.
>
> Daniel
>
> Colvin, Joshua wrote:
>   
>> Hello all,
>>
>> I am replacing some servers and wanted to install a new parallel 
>> cluster alongside the
>>
>> existing one. The qmasters will be different, but the execution nodes

>> (for now) will be
>>
>> the same. I see everything I'd expect from both qmasters (qstat -f 
>> shows all the nodes
>>
>> I've configured for both), and I can submit jobs fine to the first 
>> cluster I start, however
>>
>> the second sge execd process refuses to start on any execution node.
I
>>     
>
>   
>> see no error
>>
>> msgs anywhere (stdout, spool, /var/log/messages), but I imagine it 
>> can't bind to an
>>
>> already-used port, however I don't see where to define the port for 
>> sge execd (not in
>>
>> /etc/init.d, etc...).
>>
>>  
>>
>> Is there any trick to getting one execution host to be a member of 
>> multiple clusters?
>>
>> Thanks!
>>
>> Josh
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list