[GE users] Exec daemon can't resolve master hostname

henk h.a.slim at durham.ac.uk
Sat Jan 9 19:25:45 GMT 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



________________________________
From: reidac [mailto:andrew.reid at nist.gov]
Sent: Fri 08/01/2010 16:37
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Exec daemon can't resolve master hostname


On Fri, Jan 08, 2010 at 06:01:45AM -0500, reuti wrote:
> Am 07.01.2010 um 22:59 schrieb reidac:
>
> [ About my host-name resolution problems ]
>
> Sounds like you need an $SGE_ROOT/default/common/host_aliases file to
> tell the qmaster to run on the internal interface (i.e.just one line
> in this file):
>
> http://gridengine.sunsource.net/howto/multi_intrfcs.html
>

  This looks useful, and might indeed be part of the problem,
but it's not the whole answer -- it's not that the daemons
can't communicate at all, it's that they stop communicating
after a while.

  In the basic configuration (without the host-alias modifications),
the master daemon is listening on all interfaces, according to
netstat.  This is fine for now, although eventually I'd like
to confine it to the cluster sub-net.

  The misbehavior is more subtle -- when the exec hosts first start
up, they appear in "qstat -f" with sensible load values, and jobs will
run on them, provided I start them within the first two minutes.

  It's afterwards that they run into trouble.  The host-resolution
failure message appears on the exec host's "messages" file after
two minutes of run-time, and within a few minutes after that,
"qstat -f" on the master host starts reporting the host has
having an unknown load and being unavailable.

  I have tried a number of simple things, including diddling the
/etc/hosts file to remove conflicting aliases (i.e. not having
the canonical name of the master host be an alias for it on the
subnet), and a few others, but it's still misbehaving.

  I'm exploring the host-alias angle as well -- it's possible
that fixing this will fix the other thing.

  One thing I have noticed, that was wrong in my earlier e-mail,
is that the master host is apparently *not* configured to have
its act_qmaster set to the back-side hostname -- something is
re-setting this file to the canonical host-name at start-up.

  Investigations continue, clues still appreciated.

                                -- A.
--
Dr. Andrew C. E. Reid
Computer Operations Administrator
Center for Theoretical and Computational Materials Science
National Institute of Standards and Technology, Mail Stop 8910
Gaithersburg MD 20899 USA
andrew.reid at nist.gov

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=237413

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list