[GE users] SGE not freeing up client endpoints

Sean Dilda agrajag at dragaera.net
Mon Feb 28 21:20:38 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I'm using SGE 6.0u3.  I had a problem where if I rebooted a compute 
node, it would come back up before SGE had acknowledged that the node 
was down, and sge_execd wouldn't start right.  I believe I've fixed this 
by reducing max_unheard.  However, now when the node reboots, sge_execd 
prints out this error when it tries to start:

02/28/2005 16:09:06|execd|cbcb-n12|E|commlib error: endpoint is not 
unique error (endpoint "cbcb-n12/execd/1" is already connected)
02/28/2005 16:09:09|execd|cbcb-n12|E|getting configuration: unable to 
contact qmaster using port 535 on host "head4"
02/28/2005 16:09:09|execd|cbcb-n12|W|can't get configuration from 
qmaster -- waiting ...
02/28/2005 16:09:10|execd|cbcb-n12|E|there is already a client endpoint 
cbcb-n12/execd/1 connected to qmaster service

I will wait a few minutes after the node rebooted, and SGE is definitely 
showing it as down, however if I try to restart sge_execd, it'll still 
give this same error.  However, if I wait long enough (haven't timed to 
see how long that is), I will finally be able to start sge_execd without 
errors.

Has anyone else seen this?  Is there some reason SGE isn't freeing up 
the endpoint?  Is there something I can do to keep from having to 
manually restart sge_execd every time I reboot a compute node?

Thanks,


Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list