[GE users] SGE and TruCluster

James Chamberlain jamesc at exa.com
Tue May 9 06:57:18 BST 2006


Already have, and it didn't help.  The problem is that sge_execd binds on all 
interfaces, not that I need it and the qmaster to only communicate on one. 
When sge_execd binds to the wildcard address on one node of the TruCluster, 
that prevents sge_execd from starting on any other node.  As far as the other 
nodes are concerned, once a service has bound to a port on the cluster 
address, that port is in use and cannot be bound to on other nodes.  Take the 
following example:

Cluster_1:
Node_1
Node_2
Node_3

Node_1 binds to *:536 == Node_1:536, Cluster_1:536
Node_2 attempts to bind to *:536 == Node_2:536, Cluster_1:536
Cluster_1:536 registers as already in use -> bind() fails.
Node_2 returns that the port is in use.  sge_execd exits with an error.

That is my understanding of what is happening under TruCluster, and why the 
multiple interface trick won't work.  TruCluster is a tightly integrated 
clustering environment where many things you wouldn't expect are visible 
across the entire cluster.

Thanks for the pointer anyway, though.

James

On Mon, 8 May 2006, Ron Chen wrote:

> I don't have experience with TruCluster, but sounds like it is
> similar to the "hosts with NIC cards" issue, so see this HOWTO:
>
> http://gridengine.sunsource.net/howto/multi_intrfcs.html
>
> -Ron
>
>
> --- James Chamberlain <jamesc at exa.com> wrote:
>> That's the very problem I've been working on, Rayson.  Any
>> idea how to get
>> sge_execd to bind to the local IP?  I haven't seen an option
>> for sge_execd
>> which lets me specify which IP to bind to, but have been
>> trying to
>> investigate on the assumption that a facility to do this
>> exists within
>> TruCluster.  SGE can't be the only thing affected by an issue
>> like this.
>>
>> Thanks,
>>
>> James
>>
>> On Mon, 8 May 2006, Rayson Ho wrote:
>>
>>> May be you got everything working by now?? :)
>>>
>>> Anyway, I used to work with TruClusters a few years ago, and
>> I
>>> remember there are 2 IPs on each node (one is local to the
>> node, and
>>> one is to address the entire cluster). If you want to have
>> an SGE
>>> cluster running within the TruCluster, then you should tell
>> each execd
>>> to bind to the IP address that is local to its node.
>>>
>>> I think I need to read the TruCluster docs to refresh my
>> memory :p,
>>> but do let me know if you get any further...
>>>
>>> Rayson
>>>
>>>
>>>
>>>
>>> On 5/5/06, James Chamberlain <jamesc at exa.com> wrote:
>>>> Hi Fred,
>>>>
>>>> Thanks for the feedback.
>>>>
>>>> TruCluster is an interesting environment.  It seems to be
>> set up for load
>>>> balancing by default, with a single root filesystem shared
>> amongst all its
>>>> members.  Any changes you make on one node to most of the
>> files in /etc
>>>> immediately show up on all other nodes.
>>>>
>>>> I'm hoping there may be a less drastic way around this than
>> to patch the
>>>> code.  I notice that inetd, ntpd, apache and other network
>> services all are
>>>> listening on the wildcard address on all nodes with no
>> problem.  Sure, HP
>>>> could have modified them all to be cluster aware, but I'd
>> like to believe
>>>> there's a way for unmodified apps to bind to the wildcard
>> address, too.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>> On Fri, 5 May 2006, Fred L Youhanaie wrote:
>>>>
>>>>>
>>>>> Hi James,
>>>>>
>>>>> James Chamberlain wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Does anyone know of a way to make sge_execd bind to a
>> specific
>>>> interface?
>>>>>> By binding to *:536, sge_execd is picking up the cluster
>> IP address in a
>>>>>> TruCluster (Tru64 UNIX) environment.  As a result, other
>> nodes in the
>>>>>> TruCluster say they can't start sge_execd because port
>> 536 is already in
>>>>>> use. For reference, I'm using SGE 5.3p7.
>>>>>
>>>>> I just had a look at the source, it appears that execd
>> ultimately calls
>>>>> cl_com_tcp_connection_request_handler_setup(), which in
>> turn binds to the
>>>>> 'wildcard' address.
>>>>>
>>>>> source/libs/comm/cl_tcp_framework.c:
>>>>> =====================
>>>>> int cl_com_tcp_connection_request_handler_setup(...)
>>>>> ...
>>>>>   /* bind an address to socket */
>>>>>   ...
>>>>>   serv_addr.sin_addr.s_addr = htonl(INADDR_ANY);
>>>>>   ...
>>>>>   if (bind(sockfd, (struct sockaddr *) &serv_addr, ...
>>>>> =====================
>>>>>
>>>>> So, you will need to file an RFE, or patch it yourself!
>>>>>
>>>>> Cheers
>>>>> f.
>>>>>
>>>>>
>>
> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>
> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>>
>>
> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list