[GE users] SGE not freeing up client endpoints

Ron Chen ron_chen_123 at yahoo.com
Tue Mar 1 02:21:14 GMT 2005


Just FYI:

In SGE 5.3, we have the communication daemon (commd),
which handles the communications between the qmaster
and the execds.

In SGE 6.0, the new commniication library (libcomm.a)
handles the communications, so that's why you get the
new behaviour/bug.

 -Ron


--- "McCalla, Mac" <macmccalla at hess.com> wrote:
> Hi Sean,
> 	I am also seeing this behavior with a new
> installation of sge
> 6.0u3.
> I have set max_unheard to 00:10:00 and
> load_report_time= 00:05:00 to
> match
> my existing production 5.3p6 system.  I have not
> determined what elapsed
> time is necessary to allow the recovery to be
> automatic, but I did have
> a system that was down about 4 days, that still did
> not reconnect at
> boot time to the sge6.0u3 qmaster.  reconnection to
> the 5.3p6 system was
> successful at boot time.   
> 
> Regards,
> Mac McCalla
> 
> 
> -----Original Message-----
> From: Sean Dilda [mailto:agrajag at dragaera.net] 
> Sent: Monday, February 28, 2005 3:21 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] SGE not freeing up client
> endpoints
> 
> 
> I'm using SGE 6.0u3.  I had a problem where if I
> rebooted a compute 
> node, it would come back up before SGE had
> acknowledged that the node 
> was down, and sge_execd wouldn't start right.  I
> believe I've fixed this
> 
> by reducing max_unheard.  However, now when the node
> reboots, sge_execd 
> prints out this error when it tries to start:
> 
> 02/28/2005 16:09:06|execd|cbcb-n12|E|commlib error:
> endpoint is not 
> unique error (endpoint "cbcb-n12/execd/1" is already
> connected)
> 02/28/2005 16:09:09|execd|cbcb-n12|E|getting
> configuration: unable to 
> contact qmaster using port 535 on host "head4"
> 02/28/2005 16:09:09|execd|cbcb-n12|W|can't get
> configuration from 
> qmaster -- waiting ...
> 02/28/2005 16:09:10|execd|cbcb-n12|E|there is
> already a client endpoint 
> cbcb-n12/execd/1 connected to qmaster service
> 
> I will wait a few minutes after the node rebooted,
> and SGE is definitely
> 
> showing it as down, however if I try to restart
> sge_execd, it'll still 
> give this same error.  However, if I wait long
> enough (haven't timed to 
> see how long that is), I will finally be able to
> start sge_execd without
> 
> errors.
> 
> Has anyone else seen this?  Is there some reason SGE
> isn't freeing up 
> the endpoint?  Is there something I can do to keep
> from having to 
> manually restart sge_execd every time I reboot a
> compute node?
> 
> Thanks,
> 
> 
> Sean
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



	
		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list