[GE users] name server, and restarting sge master without losing jobs?

gutnik gutnik at gmail.com
Thu Aug 12 15:43:52 BST 2010

On Wed, Aug 11, 2010 at 12:06 PM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 11.08.2010 um 20:20 schrieb gutnik:
>> On Wed, Aug 11, 2010 at 10:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>>> Hi,
>>> Am 11.08.2010 um 19:22 schrieb gutnik:
>>>> Our network admin is changing the name server, but every time he
>>>> brings down the old name server, sge hangs.
>>>> The machine on which sge is running has the correct resolv.conf, and
>>>> can use the new name server with no problems.
>>>> So,
>>>> 1) Does SGE cache network information (including name server)? Is
>>>> there a way to flush that?
>>> IIRC there is an internal buffer for 10 minutes for the hostnames. But did I get you right, that the only the machine which runs the name server changes, but not the name of any machines? To avoid such side effects, I usually put all machines of the cluster in /etc/hosts. So even when the name server is gone, the cluster will operate like usual on the internal side.

So, that did not go well. When we changed nameservers, "qhost" showed
that several machines lost
contact with the sgeserver. /var/spool/gridengine/qmaster/messages
show things like

 worker|sgemaster|W|gethostbyname(host1.net.local) took 20 seconds and
returns TRY_AGAIN
 worker|sgemaster|E|can't send asynchronous message to commproc
(execd:1) on host "host1.net.local": can't resolve host name

...and other warnings, even though from a shell on sgemaster, "host
host1.net.local" resolves just fine. I waited
for more than the 10 minute timeout and kept getting this sort of
error. What is most likely wrong?

> Well, besides putting all the exechost in /etc/hosts on the qmaster machine, I would also run the dhcpd for the nodes on the qmaster machine.

Why is it important to put the hosts into /etc/hosts if they continue
to resolve?

>>> I don't know whether it's the case for resolv.conf, but e.g. the nsswitch.conf is only read once per process which uses it.

Is there some way to tell the sgemaster process to reread the network
information? It turns out to be quite painful to restart it.

>>>> 2) Last time I restarted the sge master server, I believe all queued
>>>> jobs were killed. Is there some way
>>> When you just shut down the qmaster and start it again, nothing should happen to any job. Neither to the running ones, nor to the waiting ones. They will just continued and waiting ones will be scheduled once the qmaster is up again.
>> Is that true even if I reboot the qmaster?
> Definitely yes. Exception applies, when the qmaster machine is also the file server for /home,... and depends on the NFS setup.

So, we saw the following: when the qmaster sge process was restarted,
sge lost contact with some of the execservers.
They showed "-" for status in qhost, and I think something like aAu
status in qmon for all their queues.

For the execservers it didn't lose contact with, we saw many lines
like this in the logs:
  worker|sgemaster|E|execd at host1.net.local reports running job
(106726.1/master) in queue "default.q at host1.net.Local" that was not
supposed to be there - killing

For the ones that did lose contact with the sgemaster, their jobs kept
running... until we did a softstop on the sgeexecd, killed the
sgeexecd, and then restarted sgeexecd. That procedure normally doesn't
cause a problem, but this time, once they contacted the
sgemaster, it killed the processes.

We don't have a relatively small installation, but there are usually a
couple dozen jobs running > 8 hours on the execservers at
any one time, so killing them all every time we make a network change
or change an exec IP address is quite painful. Is there
something I have misconfigured? What happened?



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list