[GE users] SGE 5.3p6 broken since IP/subnet migration - need help!

Chris Dagdigian dag at sonsorol.org
Mon Jun 25 21:02:33 BST 2007


Hello,

Seems like 2 problems ...

The sge_commd/tcp error is caused by a missing entry in /etc/services  
for sge_commd or perhaps an accidentally unset $SGE_COMMD_PORT  
environment variable.

Everything else after that does not matter. You need to sort out the  
TCP port and services issue before anything else has a hope of  
starting up properly.

For hostname and resolution issues the best  tools to use are the  
actual SGE binaries in your utilbin directory:

Example:
> [root at dcore-amd sge-6s2u1]# /opt/sge-6s2u1/utilbin/lx26-amd64/ 
> gethostname
> Hostname: dcore-amd.sonsorol.net
> Aliases:  dcore-amd
> Host Address(es): 66.92.70.152
>
> [root at dcore-amd sge-6s2u1]# /opt/sge-6s2u1/utilbin/lx26-amd64/ 
> gethostbyaddr 66.92.70.152
> Hostname: dcore-amd.sonsorol.net
> Aliases:  dcore-amd
> Host Address(es): 66.92.70.152
> [root at dcore-amd sge-6s2u1]#

If you fix your sge_commd/tcp error it may start -- I've always  
personally found that SGE will honor entries in the /etc/hosts file

-Chris





The hostname

On Jun 25, 2007, at 3:57 PM, Richard Hobbs wrote:

> Hello,
>
> We have recently migrated our network from a 192.168.3.0/255.255.255.0
> network to a 192.168.128.0/255.255.128.0 network, and since doing  
> so, our
> qmaster will not start.
>
> We keep getting the following:
>
> ======================================================================
> [root at stg2 sge]# /etc/init.d/rcsge start
>    starting sge_qmaster
> critical error: can't check for running qmaster: can't resolve service
> "sge_commd/tcp"
>    starting sge_schedd
> error: can't resolve hostname "stg2.domain.co.uk"
> error: can't get configuration from qmaster -- backgrounding
>    starting sge_execd
> critical error: can't enroll to commd: CANT GET SERVICE
> [root at stg2 sge]#
> ======================================================================
>
> Does anyone know what is causing this?
>
> I have even tried a global find and replace of the old IP address  
> range
> for the new IP address range, but it still doesn't startup.
>
> I'm getting desperate now, and have no ideas left, so any  
> suggestions are
> gratefully received! :-)
>
> Just for the record, the same user on the same machine in the same
> terminal *can* resolve stg2.domain.co.uk, as below:
>
> ======================================================================
> [root at stg2 sge]# host stg2.crl.toshiba.co.uk
> stg2.domain.co.uk has address 192.168.144.2
> [root at stg2 sge]#
> ======================================================================
>
> And yes - i've also tried adding stg2.domain.co.uk to /etc/hosts,  
> but the
> qmaster just will not start.
>
> Please help! :-)
>
> Thanks in advance,
> Richard.
>
> -- 
> Richard Hobbs (Systems Administrator)
> Toshiba Research Europe Ltd. - Speech Technology Group
> Web: http://www.toshiba-europe.com/research/
> Email: richard.hobbs at crl.toshiba.co.uk
> Tel: +44 1223 376964        Mobile: +44 7811 803377
>
>
>
>
> _____________________________________________________________________
> This e-mail has been scanned for viruses by Verizon Business  
> Internet Managed Scanning Services - powered by MessageLabs. For  
> further information visit http://www.verizonbusiness.com/uk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

--
Chris Dagdigian  <dag at sonsorol.org>
Current coordinates: Boston-area, USA
GPS: http://bioteam.net/dagbin/gps?42.385693+N+71.115535+W



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list