[GE users] Exec daemon can't resolve master hostname
h.a.slim at durham.ac.uk
Sat Jan 9 19:25:45 GMT 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
From: reidac [mailto:andrew.reid at nist.gov]
Sent: Fri 08/01/2010 16:37
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Exec daemon can't resolve master hostname
On Fri, Jan 08, 2010 at 06:01:45AM -0500, reuti wrote:
> Am 07.01.2010 um 22:59 schrieb reidac:
> [ About my host-name resolution problems ]
> Sounds like you need an $SGE_ROOT/default/common/host_aliases file to
> tell the qmaster to run on the internal interface (i.e.just one line
> in this file):
This looks useful, and might indeed be part of the problem,
but it's not the whole answer -- it's not that the daemons
can't communicate at all, it's that they stop communicating
after a while.
In the basic configuration (without the host-alias modifications),
the master daemon is listening on all interfaces, according to
netstat. This is fine for now, although eventually I'd like
to confine it to the cluster sub-net.
The misbehavior is more subtle -- when the exec hosts first start
up, they appear in "qstat -f" with sensible load values, and jobs will
run on them, provided I start them within the first two minutes.
It's afterwards that they run into trouble. The host-resolution
failure message appears on the exec host's "messages" file after
two minutes of run-time, and within a few minutes after that,
"qstat -f" on the master host starts reporting the host has
having an unknown load and being unavailable.
I have tried a number of simple things, including diddling the
/etc/hosts file to remove conflicting aliases (i.e. not having
the canonical name of the master host be an alias for it on the
subnet), and a few others, but it's still misbehaving.
I'm exploring the host-alias angle as well -- it's possible
that fixing this will fix the other thing.
One thing I have noticed, that was wrong in my earlier e-mail,
is that the master host is apparently *not* configured to have
its act_qmaster set to the back-side hostname -- something is
re-setting this file to the canonical host-name at start-up.
Investigations continue, clues still appreciated.
Dr. Andrew C. E. Reid
Computer Operations Administrator
Center for Theoretical and Computational Materials Science
National Institute of Standards and Technology, Mail Stop 8910
Gaithersburg MD 20899 USA
andrew.reid at nist.gov
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users