[GE users] CPUs versus slots

maluyao ma.luyao at gmail.com
Tue Oct 28 14:40:55 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

qconf -mq YOUR_QUEUE_NAME

modify it.


2008/10/28 Gerald Ragghianti <geri at utk.edu>

> Hi Reuti,
> I found the problem that was causing the cluster queue instances to report
> only 1 slot.  When I created the cluster queue, I simply took an existing
> queue and modified the hostlist to include the new nodes.  This action did
> not put the hosts in the "Attributes for Host/Hostgroup" field.  As a
> consequence, the "slots" setting for the "@/" hostgroup took effect and was
> set to 1.  I recreated the cluster queue by cloning all.q and the instances
> now report the correct number of slots.
>
> Thanks,
>
> - Gerald
>
> Reuti wrote:
>
>> Hi Gerald,
>>
>> Am 27.10.2008 um 21:40 schrieb Gerald Ragghianti:
>>
>>  Hi Reuti,
>>> I think I should probably tell you a bit more about how this system is
>>> set up.  We have the first cluster segment on a private network switch with
>>> the head node as the gateway.  These nodes use the domain "local."  The
>>> second cluster segment has nodes that are on the public network and with the
>>> domain "ornl."  Qmaster listens on both physical network interfaces.
>>>
>>
>> you mean, you gave both interfaces the same name in /etc/hosts on this
>> machine? I also did this in the past, and it's indeed working.
>>
>> As the domain names are different, the safe way would be to install SGE
>> with enabled FQDN resolving, give different names to the two interfaces and
>> a proper gateway setup in the network. This is not directly related to SGE,
>> but to a general setup of the cluster/network. As your static route setup is
>> working, you can also stay with it when you prefer this setup. Only pitfall
>> might be, that you can't copy a raw disk image from one side of the cluster
>> to use it on the other side due to different routing setups.
>>
>>   We do not share $SGE_ROOT via NFS.  Each node has a copy of $SGE_ROOT
>>> and the act_qmaster file is modified to either qmaster.local or qmaster.ornl
>>> depending on the domain that the node is in.  This works except that the
>>> queue instances for "ornl" machines have only 1 slot each.
>>>
>>
>> I'm still on the opinion, that this is only because of the queue
>> definition. Can you please post it?
>>
>> -- Reuti
>>
>>
>>
>>> So I guess the more general question is: how should SGE be set up when
>>> you have a single qmaster node with nodes on two different switches on
>>> different network interfaces?  We have used different domains for each
>>> network segment so that we can use just the machine name in configuration
>>> files and the domain will be determined by the DNS search order and resolved
>>> to the correct qmaster interface.
>>>
>>> Thanks for the help.
>>>
>>> - Gerald
>>>
>>> Reuti wrote:
>>>
>>>> Hi Gerald,
>>>>
>>>> Am 27.10.2008 um 15:33 schrieb Gerald Ragghianti:
>>>>
>>>>  Hi users,
>>>>> I recently brought online a new cluster segment that connects to my
>>>>> qmaster machine via a second network interface (the machines are on a
>>>>> different subnet that the "normal" cluster nodes).
>>>>>
>>>>
>>>> you mean on the primary interface is one of the two segments, and now
>>>> you added a new network card for a different segment?
>>>>
>>>>   I did this by changing the act_qmaster contents
>>>>>
>>>>
>>>> This means, you are not sharing $SGE_ROOT/default/common? I don't see,
>>>> why it's necessary to put the TCP/IP address there. When the act_qmaster is
>>>> in a different subnet, all what's necssary should be to enter the qmaster
>>>> also as gateway in the nodes' network setup. Hence they will discover, that
>>>> they should contact the gateway, instead of trying to access this machine
>>>> directly.
>>>>
>>>> Are all nodes still in one DNS domain? As long as the nodenames are
>>>> unique, I don't think that you must have SGE installed to honor the FQDN.
>>>>
>>>>  to indicate the ip name that corresponds to the second qmaster
>>>>> interface.  This seems to be working (I can run jobs)
>>>>>
>>>>
>>>> Interesting, I wouldn't had expect this to work.
>>>>
>>>>  with the following exception: SGE has only allocated 1 slot per machine
>>>>> on the new cluster segment.
>>>>>
>>>>> admin at ornl28.ornl              BIP   0/1       0.00     lx24-amd64
>>>>> admin at ornl29.ornl              BIP   0/1       0.00     lx24-amd64
>>>>> admin at ornl30.ornl              BIP   0/1       0.00     lx24-amd64
>>>>>
>>>>
>>>> Maybe SGE added the wrong hostname for the slot definition in the queue
>>>> configuration. The number of slots defined in the queue definition is not
>>>> related to the physical installed cores, which are correctly reported
>>>> AFAICS.
>>>>
>>>> -- Reuti
>>>>
>>>> PS: When you have a) parallel jobs with nodes from both subclusters or
>>>> b) an additonal login node and want to run interactive jobs, then a complete
>>>> routing must be set up in the qmaster I think.
>>>>
>>>>
>>>>  Even though the machines have more than one processor and SGE indicates
>>>>> this:
>>>>>
>>>>> ornl28                  lx24-amd64      2  0.00    3.9G   61.6M  960.0M
>>>>>     0.0
>>>>> ornl29                  lx24-amd64      4  0.00    7.8G  123.2M  960.0M
>>>>>     0.0
>>>>> ornl30                  lx24-amd64      4  0.00    7.8G  128.0M  960.0M
>>>>>     0.0
>>>>>
>>>>> I have installed the machines using the same automated system that
>>>>> installs the other machines on the first cluster segment (with the exception
>>>>> of changing the act_qmaster file).  What could be the problem here?  Is
>>>>> there a better way to configure this cluster segment that needs to access
>>>>> the qmaster machine via a different interface?
>>>>>
>>>>> --
>>>>> Gerald Ragghianti
>>>>> IT Administrator - High Performance Computing
>>>>> http://hpc.usg.utk.edu/
>>>>> Office of Information Technology
>>>>> University of Tennessee
>>>>> Phone: 865-974-2448
>>>>> E-mail: geri at utk.edu
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> --
>>> Gerald Ragghianti
>>> IT Administrator - High Performance Computing
>>> http://hpc.usg.utk.edu/
>>> Office of Information Technology
>>> University of Tennessee
>>> Phone: 865-974-2448
>>> E-mail: geri at utk.edu
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> --
> Gerald Ragghianti
> IT Administrator - High Performance Computing
> http://hpc.usg.utk.edu/
> Office of Information Technology
> University of Tennessee
> Phone: 865-974-2448
> E-mail: geri at utk.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list