[GE users] name server, and restarting sge master without losing jobs?

reuti reuti at staff.uni-marburg.de
Wed Aug 11 20:06:10 BST 2010


Am 11.08.2010 um 20:20 schrieb gutnik:

> On Wed, Aug 11, 2010 at 10:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
>> Hi,
>> 
>> Am 11.08.2010 um 19:22 schrieb gutnik:
>> 
>>> Our network admin is changing the name server, but every time he
>>> brings down the old name server, sge hangs.
>>> The machine on which sge is running has the correct resolv.conf, and
>>> can use the new name server with no problems.
>>> 
>>> So,
>>> 
>>> 1) Does SGE cache network information (including name server)? Is
>>> there a way to flush that?
>>> 
>> 
>> IIRC there is an internal buffer for 10 minutes for the hostnames. But did I get you right, that the only the machine which runs the name server changes, but not the name of any machines? To avoid such side effects, I usually put all machines of the cluster in /etc/hosts. So even when the name server is gone, the cluster will operate like usual on the internal side.
> 
> Yes, it's just the nameserver that changes. But it's pretty clearly
> related to that-- if we bring down the old nameserver, qstat, qhost,
> qmon... all the sge
> commands take a very long time to finish (like they're waiting for
> some timeout), and many of the exec hosts stop showing up in the qhost
> list.

Well, besides putting all the exechost in /etc/hosts on the qmaster machine, I would also run the dhcpd for the nodes on the qmaster machine.


>> I don't know whether it's the case for resolv.conf, but e.g. the nsswitch.conf is only read once per process which uses it.
> 
> 
>>> 2) Last time I restarted the sge master server, I believe all queued
>>> jobs were killed. Is there some way
>> 
>> When you just shut down the qmaster and start it again, nothing should happen to any job. Neither to the running ones, nor to the waiting ones. They will just continued and waiting ones will be scheduled once the qmaster is up again.
> 
> Is that true even if I reboot the qmaster?

Definitely yes. Exception applies, when the qmaster machine is also the file server for /home,... and depends on the NFS setup.

-- Reuti


>> If such a thing happens that you miss some jobs, next step to investigate is the message file of the qmaster. Maybe some jobs just ended while the qmaster was offline.
> 
> Well, I'll hope for the best then. Thank you.
> 
>  Vadim
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=273782
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=273791

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list