[GE users] SGE not freeing up client endpoints

Marco Donauer Marco.Donauer at Sun.COM
Tue Mar 8 19:50:55 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Olle,

thanks for your patch!
I will have a look on it and check it in then!
Thanks again!

Regards,
Marco

Olle Liljenzin wrote:

> We see this when exced is not properly stopped at shutdown.
>
> There are two patches for the linux installation that just needs to be 
> checked in to the cvs by someone with permission to do it:
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1424
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1426
>
> /Olle
>
> Sean Dilda wrote:
>
>> I'm using SGE 6.0u3.  I had a problem where if I rebooted a compute 
>> node, it would come back up before SGE had acknowledged that the node 
>> was down, and sge_execd wouldn't start right.  I believe I've fixed 
>> this by reducing max_unheard.  However, now when the node reboots, 
>> sge_execd prints out this error when it tries to start:
>>
>> 02/28/2005 16:09:06|execd|cbcb-n12|E|commlib error: endpoint is not 
>> unique error (endpoint "cbcb-n12/execd/1" is already connected)
>> 02/28/2005 16:09:09|execd|cbcb-n12|E|getting configuration: unable to 
>> contact qmaster using port 535 on host "head4"
>> 02/28/2005 16:09:09|execd|cbcb-n12|W|can't get configuration from 
>> qmaster -- waiting ...
>> 02/28/2005 16:09:10|execd|cbcb-n12|E|there is already a client 
>> endpoint cbcb-n12/execd/1 connected to qmaster service
>>
>> I will wait a few minutes after the node rebooted, and SGE is 
>> definitely showing it as down, however if I try to restart sge_execd, 
>> it'll still give this same error.  However, if I wait long enough 
>> (haven't timed to see how long that is), I will finally be able to 
>> start sge_execd without errors.
>>
>> Has anyone else seen this?  Is there some reason SGE isn't freeing up 
>> the endpoint?  Is there something I can do to keep from having to 
>> manually restart sge_execd every time I reboot a compute node?
>>
>> Thanks,
>>
>>
>> Sean
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list