[GE users] CPUs versus slots

Gerald Ragghianti geri at utk.edu
Tue Oct 28 13:06:08 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,
I found the problem that was causing the cluster queue instances to 
report only 1 slot.  When I created the cluster queue, I simply took an 
existing queue and modified the hostlist to include the new nodes.  This 
action did not put the hosts in the "Attributes for Host/Hostgroup" 
field.  As a consequence, the "slots" setting for the "@/" hostgroup 
took effect and was set to 1.  I recreated the cluster queue by cloning 
all.q and the instances now report the correct number of slots.

Thanks,

- Gerald

Reuti wrote:
> Hi Gerald,
>
> Am 27.10.2008 um 21:40 schrieb Gerald Ragghianti:
>
>> Hi Reuti,
>> I think I should probably tell you a bit more about how this system 
>> is set up.  We have the first cluster segment on a private network 
>> switch with the head node as the gateway.  These nodes use the domain 
>> "local."  The second cluster segment has nodes that are on the public 
>> network and with the domain "ornl."  Qmaster listens on both physical 
>> network interfaces.
>
> you mean, you gave both interfaces the same name in /etc/hosts on this 
> machine? I also did this in the past, and it's indeed working.
>
> As the domain names are different, the safe way would be to install 
> SGE with enabled FQDN resolving, give different names to the two 
> interfaces and a proper gateway setup in the network. This is not 
> directly related to SGE, but to a general setup of the 
> cluster/network. As your static route setup is working, you can also 
> stay with it when you prefer this setup. Only pitfall might be, that 
> you can't copy a raw disk image from one side of the cluster to use it 
> on the other side due to different routing setups.
>
>>   We do not share $SGE_ROOT via NFS.  Each node has a copy of 
>> $SGE_ROOT and the act_qmaster file is modified to either 
>> qmaster.local or qmaster.ornl depending on the domain that the node 
>> is in.  This works except that the queue instances for "ornl" 
>> machines have only 1 slot each.
>
> I'm still on the opinion, that this is only because of the queue 
> definition. Can you please post it?
>
> -- Reuti
>
>
>>
>> So I guess the more general question is: how should SGE be set up 
>> when you have a single qmaster node with nodes on two different 
>> switches on different network interfaces?  We have used different 
>> domains for each network segment so that we can use just the machine 
>> name in configuration files and the domain will be determined by the 
>> DNS search order and resolved to the correct qmaster interface.
>>
>> Thanks for the help.
>>
>> - Gerald
>>
>> Reuti wrote:
>>> Hi Gerald,
>>>
>>> Am 27.10.2008 um 15:33 schrieb Gerald Ragghianti:
>>>
>>>> Hi users,
>>>> I recently brought online a new cluster segment that connects to my 
>>>> qmaster machine via a second network interface (the machines are on 
>>>> a different subnet that the "normal" cluster nodes).
>>>
>>> you mean on the primary interface is one of the two segments, and 
>>> now you added a new network card for a different segment?
>>>
>>>>   I did this by changing the act_qmaster contents
>>>
>>> This means, you are not sharing $SGE_ROOT/default/common? I don't 
>>> see, why it's necessary to put the TCP/IP address there. When the 
>>> act_qmaster is in a different subnet, all what's necssary should be 
>>> to enter the qmaster also as gateway in the nodes' network setup. 
>>> Hence they will discover, that they should contact the gateway, 
>>> instead of trying to access this machine directly.
>>>
>>> Are all nodes still in one DNS domain? As long as the nodenames are 
>>> unique, I don't think that you must have SGE installed to honor the 
>>> FQDN.
>>>
>>>> to indicate the ip name that corresponds to the second qmaster 
>>>> interface.  This seems to be working (I can run jobs)
>>>
>>> Interesting, I wouldn't had expect this to work.
>>>
>>>> with the following exception: SGE has only allocated 1 slot per 
>>>> machine on the new cluster segment.
>>>>
>>>> admin at ornl28.ornl              BIP   0/1       0.00     
>>>> lx24-amd64   admin at ornl29.ornl              BIP   0/1       
>>>> 0.00     lx24-amd64   admin at ornl30.ornl              BIP   
>>>> 0/1       0.00     lx24-amd64
>>>
>>> Maybe SGE added the wrong hostname for the slot definition in the 
>>> queue configuration. The number of slots defined in the queue 
>>> definition is not related to the physical installed cores, which are 
>>> correctly reported AFAICS.
>>>
>>> -- Reuti
>>>
>>> PS: When you have a) parallel jobs with nodes from both subclusters 
>>> or b) an additonal login node and want to run interactive jobs, then 
>>> a complete routing must be set up in the qmaster I think.
>>>
>>>
>>>> Even though the machines have more than one processor and SGE 
>>>> indicates this:
>>>>
>>>> ornl28                  lx24-amd64      2  0.00    3.9G   61.6M  
>>>> 960.0M     0.0
>>>> ornl29                  lx24-amd64      4  0.00    7.8G  123.2M  
>>>> 960.0M     0.0
>>>> ornl30                  lx24-amd64      4  0.00    7.8G  128.0M  
>>>> 960.0M     0.0
>>>>
>>>> I have installed the machines using the same automated system that 
>>>> installs the other machines on the first cluster segment (with the 
>>>> exception of changing the act_qmaster file).  What could be the 
>>>> problem here?  Is there a better way to configure this cluster 
>>>> segment that needs to access the qmaster machine via a different 
>>>> interface?
>>>>
>>>> -- 
>>>> Gerald Ragghianti
>>>> IT Administrator - High Performance Computing
>>>> http://hpc.usg.utk.edu/
>>>> Office of Information Technology
>>>> University of Tennessee
>>>> Phone: 865-974-2448
>>>> E-mail: geri at utk.edu
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> -- 
>> Gerald Ragghianti
>> IT Administrator - High Performance Computing
>> http://hpc.usg.utk.edu/
>> Office of Information Technology
>> University of Tennessee
>> Phone: 865-974-2448
>> E-mail: geri at utk.edu
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


-- 
Gerald Ragghianti
IT Administrator - High Performance Computing
http://hpc.usg.utk.edu/
Office of Information Technology
University of Tennessee
Phone: 865-974-2448
E-mail: geri at utk.edu


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list