[GE users] Errorno=111 when doing qstat

Chris Dagdigian dag at sonsorol.org
Thu Oct 28 23:26:04 BST 2004


Jeffrey,

Grid Engine is very sensitive to hostname resolution and DNS issues. A 
quick workaround that may get you pass the unable to connect errors is 
to construct a SGE host_aliases file that will point your daemons to the 
IP address of your private warewulf network.

127.0.0.1 is the local loopback address and should probably not be 
associated with a eth device. You may just want to give your network 
card some actual IP address from the 192.168.x or a different 10.0.0 
subnet. That alone may be causing you your problems.

A simpler option may be to just shut eth0 off since you are not using it 
at all. Just use the device attached to the network with your nodes on it.

On to more productive things...

First verify the hostname that Grid Engine thinks it should be 
connecting to. This is the file "act_qmaster" and it will be located at 
  $SGE_ROOT/$SGE_CELL/common/act_qmaster. I bet it says localhost 
despite your intention for it to say master-admin.

When Grid Engine starts up, it reads that file to learn the hostname it 
shoud connect to. It then looks for a file called "host_aliases" in the 
same directory. You can use this aliase file to force connections to go 
to your private NIC with address 10.0.0.253

Whatever hostname is listed in act_qmaster should be put into a 
host_aliases file followed by the 'alias' which is master-admin or 
10.0.0.253

I'm betting that SGE is starting up, learning its hostname as 
'localhost' and then bombing out.

A host_aliases file that may work for you could be something like:

   localhost master-admin


Of course master-admin needs to be defined in your /etc/hosts file for 
this to even have a chance of working.

--Chris




Jeffrey B. Layton wrote:

> Hello,
> 
>   I had posted this email earlier and the response I got didn't
> help and unfortunately I had to go out of town so I couldn't
> follow-up. Just to recap, I'm running SGE 6.0p1 on a small
> Warewulf cluster running Linux. The  master node has two
> network connections: eth0 to the  outside world as 127.0.0.1
> with a host name of localhost; eth1 is only known to the cluster
> as 10.0.0.253 with a hostname of master-admin.  (the cluster is
> not on the net, so I haven't configured eth0 as anything but
> localhost). I configured SGE to use master-admin as the host
> name. When I try 'qstat' it  tells me that it can't connect to
> localhost and gives me  an error number of 111. I looked
> through the manual but didn't find anything and I tried to
> follow the directions on the SGE website for having two
> network cards in the master, but the instructions are Sun
> specific. I was wondering if someone could help me?
> (BTW, I used the binaries  on the SGE website for the 2.4
> kernel, glibc 2.2).
> 
> 
> TIA!
> 
> Jeff
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
Chris Dagdigian, <dag at sonsorol.org>
BioTeam  - Independent life science IT & informatics consulting
Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list