[GE users] name server, and restarting sge master without losing jobs?

gutnik gutnik at gmail.com
Wed Aug 11 19:20:20 BST 2010


On Wed, Aug 11, 2010 at 10:43 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 11.08.2010 um 19:22 schrieb gutnik:
>
>> Our network admin is changing the name server, but every time he
>> brings down the old name server, sge hangs.
>> The machine on which sge is running has the correct resolv.conf, and
>> can use the new name server with no problems.
>>
>> So,
>>
>> 1) Does SGE cache network information (including name server)? Is
>> there a way to flush that?
>>
>
> IIRC there is an internal buffer for 10 minutes for the hostnames. But did I get you right, that the only the machine which runs the name server changes, but not the name of any machines? To avoid such side effects, I usually put all machines of the cluster in /etc/hosts. So even when the name server is gone, the cluster will operate like usual on the internal side.

Yes, it's just the nameserver that changes. But it's pretty clearly
related to that-- if we bring down the old nameserver, qstat, qhost,
qmon... all the sge
commands take a very long time to finish (like they're waiting for
some timeout), and many of the exec hosts stop showing up in the qhost
list.

> I don't know whether it's the case for resolv.conf, but e.g. the nsswitch.conf is only read once per process which uses it.


>> 2) Last time I restarted the sge master server, I believe all queued
>> jobs were killed. Is there some way
>
> When you just shut down the qmaster and start it again, nothing should happen to any job. Neither to the running ones, nor to the waiting ones. They will just continued and waiting ones will be scheduled once the qmaster is up again.

Is that true even if I reboot the qmaster?

> If such a thing happens that you miss some jobs, next step to investigate is the message file of the qmaster. Maybe some jobs just ended while the qmaster was offline.

Well, I'll hope for the best then. Thank you.

  Vadim

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=273782

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list