[GE users] advice re: network topology aware MPI host selection

Reuti reuti at staff.uni-marburg.de
Tue Oct 14 16:58:11 BST 2008


Hi Chris,

Am 14.10.2008 um 17:53 schrieb Chris Dagdigian:

>
> Following up on this thread again ...
>
> Using the approach that Dan recommended below to "pin" certain  
> types of MPI jobs to certain blades that are advantageously close  
> in network topology terms:
>
> pe_list               make,[@hg1=make2],[@hg2=make2 make3] ...

you are getting this also if you enter more than one line by using  
backslashes?

Or setting export SGE_SINGLE_LINE=1 ?

-- Reuti

> ... we run into an issue where we hit some sort of 100 character  
> field limit for "pe_list" line length
>
> My main question is to ask what people think about the size  
> limitation on the "pe_list" field.
>
> If I want this field to be bigger should I file an Issue or should  
> I file an RFE ?
>
> -Chris
>
>
>
>
> On Oct 3, 2008, at 10:32 AM, Daniel Templeton wrote:
>
>> Why not just add the PEs to the queue instances instead of to the  
>> queue, e.g.:
>>
>> %qconf -sq bladeCluster.q | grep pe_list
>> pe_list     make,[@chassis1=make MPI1],[@chassis2=make MPI2]
>>
>> or more clearly:
>>
>> %qconf -sq bladeCluster.q@@chassis1 | grep pe_list
>> pe_list     make MPI1
>>
>> Daniel
>>
>> Chris Dagdigian wrote:
>>>
>>> Hi folks,
>>>
>>> I'm dealing with a latency-sensitive MPI application that is  
>>> having some trouble due to network topology issues.
>>>
>>> The scenario:
>>>
>>> - Standard blade cluster made up of many 14 blade chassis
>>> - 14 server blades per chassis
>>> - In each chassis there are a pair of 7-port Gigabit internal  
>>> switch modules w/ 2Gb/sec interconnect between each
>>> - In each chassis there is a 100Mbit external uplink to an  
>>> aggregation switch that aggregates all the blade chassis
>>>
>>> The cluster is (by design) not meant to support latency sensitive  
>>> apps so we are just trying to do the best we can with the network  
>>> we have.
>>>
>>> What we see:
>>>
>>> - Application performs great when all MPI nodes are within the 7- 
>>> port internal switch zone
>>> - Application performs almost as well when it stays within the  
>>> same blade chassis and just spans the internal switching modules
>>> - Application performance drops to unacceptable when it has to  
>>> span multiple blade chassis (the 100mbit link is killing us)
>>>
>>> This falls into the standard sort of SGE question "how do I pin  
>>> my MPI jobs to particular groups of machines?"
>>>
>>> The normal best practice in this case would be:
>>>
>>> - Define multiple PEs
>>> - Create multiple queues, each with a custom tuned hostlist
>>> - Associate a PE to each of the new queues
>>> - Wildcard PE submission "qsub -pe MPI* "
>>>
>>> ... with wildcard PEs we'll be dispatched to a cluster queue that  
>>> just so happens to be configured with an ideal hostlist from a  
>>> topology perspective.
>>>
>>>
>>> I've got a problem with this though.
>>>
>>> It requires multiple queues to be set up because PEs themselves  
>>> no longer have a hostlist associated with them so any host  
>>> groupings need to be done at the queue level. This particular  
>>> user has bought entirely into the SGE mindset of "less queues is  
>>> better" and they are already doing really excellent with a single  
>>> "bladeCluster.q" queue.
>>>
>>> Forcing a user who has already worked hard to get down to a  
>>> single cluster queue to now deploy a bunch of extra queues just  
>>> to get the topology-aware MPI host selection is really not going  
>>> to be an ideal outcome.
>>>
>>> I know that the wildcard PE method would work. I would just  
>>> rather not configure a ton of new cluster queues just to get one  
>>> application performing better.
>>>
>>> Does anyone have any other methods that may work?
>>>
>>> How about hostgroups? Can I use wildcards with hostgroups? Can I  
>>> submit with multiple comma or space separated queue requests?
>>>
>>> I was thinking about doing something like this:
>>>
>>> (1) define hostgroups according to ideal network topology
>>> (2) submit jobs with multiple queue requests like so:
>>>
>>> qsub -q "all.q@@NodeGroup1 all.q@@NodeGroup2 all.q@@NodeGroup3 ... "
>>>
>>> Ideally wildcards would allow something like:
>>>
>>> qsub -q all.q@@NodeGrouping*
>>>
>>>
>>>
>>>
>>> Any tips or recommendations welcome. I'd really like to avoid  
>>> configuring multiple queues for one app.
>>>
>>> -Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list