[GE users] name server, and restarting sge master without losing jobs?

reuti reuti at staff.uni-marburg.de
Thu Aug 12 23:21:40 BST 2010


Am 12.08.2010 um 16:43 schrieb gutnik:

>>>> <snip>
> So, that did not go well. When we changed nameservers, "qhost" showed
> that several machines lost
> contact with the sgeserver. /var/spool/gridengine/qmaster/messages
> show things like
> worker|sgemaster|W|gethostbyname(host1.net.local) took 20 seconds and
> returns TRY_AGAIN

yep, the sgemaster couldn't contact the new nameserver for any reason. I'm not sure, whether this is SGE related, but to the `nscd` cache. This you can reset with:

$ nscd -i hosts

Can you try the tools in $SGE_ROOT/utilbin/lx24-amd64? Do they resolve the addresses correctly (`gethostbyname`, ...)?

> worker|sgemaster|E|can't send asynchronous message to commproc
> (execd:1) on host "host1.net.local": can't resolve host name
> ...and other warnings, even though from a shell on sgemaster, "host
> host1.net.local" resolves just fine. I waited
> for more than the 10 minute timeout and kept getting this sort of
> error. What is most likely wrong?
>> Well, besides putting all the exechost in /etc/hosts on the qmaster machine, I would also run the dhcpd for the nodes on the qmaster machine.
> Why is it important to put the hosts into /etc/hosts if they continue
> to resolve?

To be on the safe side. With your current configuration the health of the cluster depends on the health of the external nameserver. When you run a private nameserver for your nodes on the headnode of the cluster, the chances are lower to lose this service.

>>>> I don't know whether it's the case for resolv.conf, but e.g. the nsswitch.conf is only read once per process which uses it.
> Is there some way to tell the sgemaster process to reread the network
> information? It turns out to be quite painful to restart it.
>>>>> 2) Last time I restarted the sge master server, I believe all queued
>>>>> jobs were killed. Is there some way
>>>> When you just shut down the qmaster and start it again, nothing should happen to any job. Neither to the running ones, nor to the waiting ones. They will just continued and waiting ones will be scheduled once the qmaster is up again.
>>> Is that true even if I reboot the qmaster?
>> Definitely yes. Exception applies, when the qmaster machine is also the file server for /home,... and depends on the NFS setup.
> So, we saw the following: when the qmaster sge process was restarted,
> sge lost contact with some of the execservers.
> They showed "-" for status in qhost, and I think something like aAu
> status in qmon for all their queues.
> For the execservers it didn't lose contact with, we saw many lines
> like this in the logs:
>  worker|sgemaster|E|execd at host1.net.local reports running job
> (106726.1/master) in queue "default.q at host1.net.Local" that was not
> supposed to be there - killing
> For the ones that did lose contact with the sgemaster, their jobs kept
> running... until we did a softstop on the sgeexecd, killed the
> sgeexecd, and then restarted sgeexecd. That procedure normally doesn't
> cause a problem, but this time, once they contacted the
> sgemaster, it killed the processes.

All nodes get fixed TCP/IP addresses based on the MAC address?

> We don't have a relatively small installation, but there are usually a
> couple dozen jobs running > 8 hours on the execservers at
> any one time, so killing them all every time we make a network change
> or change an exec IP address is quite painful.

Most likely you have to remove a node first, change the name/MAC address relation in the DHCP server and then add it again with the new address.

> Is there
> something I have misconfigured? What happened?

SGE is not designed to have varying TCP/IP addresses (i.e. dynamic ones), or changing them on-the-fly. So it's best to have all fixed (this can still be done by DHCP for the nodes). How is your network layout? When all exechosts are in a private subnet and the headnode has two network cards, the outside net can be changed without an effect on the internal network and operation of the cluster.

-- Reuti

>  Vadim
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274003
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list