[GE users] CPUs versus slots

Reuti reuti at Staff.Uni-Marburg.DE
Tue Oct 28 12:38:03 GMT 2008


Hi Gerald,

Am 27.10.2008 um 21:40 schrieb Gerald Ragghianti:

> Hi Reuti,
> I think I should probably tell you a bit more about how this system  
> is set up.  We have the first cluster segment on a private network  
> switch with the head node as the gateway.  These nodes use the  
> domain "local."  The second cluster segment has nodes that are on  
> the public network and with the domain "ornl."  Qmaster listens on  
> both physical network interfaces.

you mean, you gave both interfaces the same name in /etc/hosts on  
this machine? I also did this in the past, and it's indeed working.

As the domain names are different, the safe way would be to install  
SGE with enabled FQDN resolving, give different names to the two  
interfaces and a proper gateway setup in the network. This is not  
directly related to SGE, but to a general setup of the cluster/ 
network. As your static route setup is working, you can also stay  
with it when you prefer this setup. Only pitfall might be, that you  
can't copy a raw disk image from one side of the cluster to use it on  
the other side due to different routing setups.

>   We do not share $SGE_ROOT via NFS.  Each node has a copy of  
> $SGE_ROOT and the act_qmaster file is modified to either  
> qmaster.local or qmaster.ornl depending on the domain that the node  
> is in.  This works except that the queue instances for "ornl"  
> machines have only 1 slot each.

I'm still on the opinion, that this is only because of the queue  
definition. Can you please post it?

-- Reuti


>
> So I guess the more general question is: how should SGE be set up  
> when you have a single qmaster node with nodes on two different  
> switches on different network interfaces?  We have used different  
> domains for each network segment so that we can use just the  
> machine name in configuration files and the domain will be  
> determined by the DNS search order and resolved to the correct  
> qmaster interface.
>
> Thanks for the help.
>
> - Gerald
>
> Reuti wrote:
>> Hi Gerald,
>>
>> Am 27.10.2008 um 15:33 schrieb Gerald Ragghianti:
>>
>>> Hi users,
>>> I recently brought online a new cluster segment that connects to  
>>> my qmaster machine via a second network interface (the machines  
>>> are on a different subnet that the "normal" cluster nodes).
>>
>> you mean on the primary interface is one of the two segments, and  
>> now you added a new network card for a different segment?
>>
>>>   I did this by changing the act_qmaster contents
>>
>> This means, you are not sharing $SGE_ROOT/default/common? I don't  
>> see, why it's necessary to put the TCP/IP address there. When the  
>> act_qmaster is in a different subnet, all what's necssary should  
>> be to enter the qmaster also as gateway in the nodes' network  
>> setup. Hence they will discover, that they should contact the  
>> gateway, instead of trying to access this machine directly.
>>
>> Are all nodes still in one DNS domain? As long as the nodenames  
>> are unique, I don't think that you must have SGE installed to  
>> honor the FQDN.
>>
>>> to indicate the ip name that corresponds to the second qmaster  
>>> interface.  This seems to be working (I can run jobs)
>>
>> Interesting, I wouldn't had expect this to work.
>>
>>> with the following exception: SGE has only allocated 1 slot per  
>>> machine on the new cluster segment.
>>>
>>> admin at ornl28.ornl              BIP   0/1       0.00     lx24- 
>>> amd64   admin at ornl29.ornl              BIP   0/1       0.00      
>>> lx24-amd64   admin at ornl30.ornl              BIP   0/1        
>>> 0.00     lx24-amd64
>>
>> Maybe SGE added the wrong hostname for the slot definition in the  
>> queue configuration. The number of slots defined in the queue  
>> definition is not related to the physical installed cores, which  
>> are correctly reported AFAICS.
>>
>> -- Reuti
>>
>> PS: When you have a) parallel jobs with nodes from both  
>> subclusters or b) an additonal login node and want to run  
>> interactive jobs, then a complete routing must be set up in the  
>> qmaster I think.
>>
>>
>>> Even though the machines have more than one processor and SGE  
>>> indicates this:
>>>
>>> ornl28                  lx24-amd64      2  0.00    3.9G   61.6M   
>>> 960.0M     0.0
>>> ornl29                  lx24-amd64      4  0.00    7.8G  123.2M   
>>> 960.0M     0.0
>>> ornl30                  lx24-amd64      4  0.00    7.8G  128.0M   
>>> 960.0M     0.0
>>>
>>> I have installed the machines using the same automated system  
>>> that installs the other machines on the first cluster segment  
>>> (with the exception of changing the act_qmaster file).  What  
>>> could be the problem here?  Is there a better way to configure  
>>> this cluster segment that needs to access the qmaster machine via  
>>> a different interface?
>>>
>>> -- 
>>> Gerald Ragghianti
>>> IT Administrator - High Performance Computing
>>> http://hpc.usg.utk.edu/
>>> Office of Information Technology
>>> University of Tennessee
>>> Phone: 865-974-2448
>>> E-mail: geri at utk.edu
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> -- 
> Gerald Ragghianti
> IT Administrator - High Performance Computing
> http://hpc.usg.utk.edu/
> Office of Information Technology
> University of Tennessee
> Phone: 865-974-2448
> E-mail: geri at utk.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list