[GE users] CPUs versus slots

Gerald Ragghianti geri at utk.edu
Mon Oct 27 20:40:45 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,
I think I should probably tell you a bit more about how this system is 
set up.  We have the first cluster segment on a private network switch 
with the head node as the gateway.  These nodes use the domain "local."  
The second cluster segment has nodes that are on the public network and 
with the domain "ornl."  Qmaster listens on both physical network 
interfaces.  We do not share $SGE_ROOT via NFS.  Each node has a copy of 
$SGE_ROOT and the act_qmaster file is modified to either qmaster.local 
or qmaster.ornl depending on the domain that the node is in.  This works 
except that the queue instances for "ornl" machines have only 1 slot each.

So I guess the more general question is: how should SGE be set up when 
you have a single qmaster node with nodes on two different switches on 
different network interfaces?  We have used different domains for each 
network segment so that we can use just the machine name in 
configuration files and the domain will be determined by the DNS search 
order and resolved to the correct qmaster interface.

Thanks for the help.

- Gerald

Reuti wrote:
> Hi Gerald,
>
> Am 27.10.2008 um 15:33 schrieb Gerald Ragghianti:
>
>> Hi users,
>> I recently brought online a new cluster segment that connects to my 
>> qmaster machine via a second network interface (the machines are on a 
>> different subnet that the "normal" cluster nodes).
>
> you mean on the primary interface is one of the two segments, and now 
> you added a new network card for a different segment?
>
>>   I did this by changing the act_qmaster contents
>
> This means, you are not sharing $SGE_ROOT/default/common? I don't see, 
> why it's necessary to put the TCP/IP address there. When the 
> act_qmaster is in a different subnet, all what's necssary should be to 
> enter the qmaster also as gateway in the nodes' network setup. Hence 
> they will discover, that they should contact the gateway, instead of 
> trying to access this machine directly.
>
> Are all nodes still in one DNS domain? As long as the nodenames are 
> unique, I don't think that you must have SGE installed to honor the FQDN.
>
>> to indicate the ip name that corresponds to the second qmaster 
>> interface.  This seems to be working (I can run jobs)
>
> Interesting, I wouldn't had expect this to work.
>
>> with the following exception: SGE has only allocated 1 slot per 
>> machine on the new cluster segment.
>>
>> admin at ornl28.ornl              BIP   0/1       0.00     lx24-amd64   
>> admin at ornl29.ornl              BIP   0/1       0.00     lx24-amd64   
>> admin at ornl30.ornl              BIP   0/1       0.00     lx24-amd64
>
> Maybe SGE added the wrong hostname for the slot definition in the 
> queue configuration. The number of slots defined in the queue 
> definition is not related to the physical installed cores, which are 
> correctly reported AFAICS.
>
> -- Reuti
>
> PS: When you have a) parallel jobs with nodes from both subclusters or 
> b) an additonal login node and want to run interactive jobs, then a 
> complete routing must be set up in the qmaster I think.
>
>
>> Even though the machines have more than one processor and SGE 
>> indicates this:
>>
>> ornl28                  lx24-amd64      2  0.00    3.9G   61.6M  
>> 960.0M     0.0
>> ornl29                  lx24-amd64      4  0.00    7.8G  123.2M  
>> 960.0M     0.0
>> ornl30                  lx24-amd64      4  0.00    7.8G  128.0M  
>> 960.0M     0.0
>>
>> I have installed the machines using the same automated system that 
>> installs the other machines on the first cluster segment (with the 
>> exception of changing the act_qmaster file).  What could be the 
>> problem here?  Is there a better way to configure this cluster 
>> segment that needs to access the qmaster machine via a different 
>> interface?
>>
>> -- 
>> Gerald Ragghianti
>> IT Administrator - High Performance Computing
>> http://hpc.usg.utk.edu/
>> Office of Information Technology
>> University of Tennessee
>> Phone: 865-974-2448
>> E-mail: geri at utk.edu
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


-- 
Gerald Ragghianti
IT Administrator - High Performance Computing
http://hpc.usg.utk.edu/
Office of Information Technology
University of Tennessee
Phone: 865-974-2448
E-mail: geri at utk.edu


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list