[GE users] SGE not freeing up client endpoints

McCalla, Mac macmccalla at hess.com
Mon Feb 28 22:00:55 GMT 2005


Hi Sean,
	I am also seeing this behavior with a new installation of sge
6.0u3.
I have set max_unheard to 00:10:00 and load_report_time= 00:05:00 to
match
my existing production 5.3p6 system.  I have not determined what elapsed
time is necessary to allow the recovery to be automatic, but I did have
a system that was down about 4 days, that still did not reconnect at
boot time to the sge6.0u3 qmaster.  reconnection to the 5.3p6 system was
successful at boot time.   

Regards,
Mac McCalla


-----Original Message-----
From: Sean Dilda [mailto:agrajag at dragaera.net] 
Sent: Monday, February 28, 2005 3:21 PM
To: users at gridengine.sunsource.net
Subject: [GE users] SGE not freeing up client endpoints


I'm using SGE 6.0u3.  I had a problem where if I rebooted a compute 
node, it would come back up before SGE had acknowledged that the node 
was down, and sge_execd wouldn't start right.  I believe I've fixed this

by reducing max_unheard.  However, now when the node reboots, sge_execd 
prints out this error when it tries to start:

02/28/2005 16:09:06|execd|cbcb-n12|E|commlib error: endpoint is not 
unique error (endpoint "cbcb-n12/execd/1" is already connected)
02/28/2005 16:09:09|execd|cbcb-n12|E|getting configuration: unable to 
contact qmaster using port 535 on host "head4"
02/28/2005 16:09:09|execd|cbcb-n12|W|can't get configuration from 
qmaster -- waiting ...
02/28/2005 16:09:10|execd|cbcb-n12|E|there is already a client endpoint 
cbcb-n12/execd/1 connected to qmaster service

I will wait a few minutes after the node rebooted, and SGE is definitely

showing it as down, however if I try to restart sge_execd, it'll still 
give this same error.  However, if I wait long enough (haven't timed to 
see how long that is), I will finally be able to start sge_execd without

errors.

Has anyone else seen this?  Is there some reason SGE isn't freeing up 
the endpoint?  Is there something I can do to keep from having to 
manually restart sge_execd every time I reboot a compute node?

Thanks,


Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list