[GE users] advice re: network topology aware MPI host selection

Chris Dagdigian dag at sonsorol.org
Tue Oct 14 16:53:28 BST 2008


Following up on this thread again ...

Using the approach that Dan recommended below to "pin" certain types  
of MPI jobs to certain blades that are advantageously close in network  
topology terms:

pe_list               make,[@hg1=make2],[@hg2=make2 make3] ...

... we run into an issue where we hit some sort of 100 character field  
limit for "pe_list" line length

My main question is to ask what people think about the size limitation  
on the "pe_list" field.

If I want this field to be bigger should I file an Issue or should I  
file an RFE ?

-Chris




On Oct 3, 2008, at 10:32 AM, Daniel Templeton wrote:

> Why not just add the PEs to the queue instances instead of to the  
> queue, e.g.:
>
> %qconf -sq bladeCluster.q | grep pe_list
> pe_list     make,[@chassis1=make MPI1],[@chassis2=make MPI2]
>
> or more clearly:
>
> %qconf -sq bladeCluster.q@@chassis1 | grep pe_list
> pe_list     make MPI1
>
> Daniel
>
> Chris Dagdigian wrote:
>>
>> Hi folks,
>>
>> I'm dealing with a latency-sensitive MPI application that is having  
>> some trouble due to network topology issues.
>>
>> The scenario:
>>
>> - Standard blade cluster made up of many 14 blade chassis
>> - 14 server blades per chassis
>> - In each chassis there are a pair of 7-port Gigabit internal  
>> switch modules w/ 2Gb/sec interconnect between each
>> - In each chassis there is a 100Mbit external uplink to an  
>> aggregation switch that aggregates all the blade chassis
>>
>> The cluster is (by design) not meant to support latency sensitive  
>> apps so we are just trying to do the best we can with the network  
>> we have.
>>
>> What we see:
>>
>> - Application performs great when all MPI nodes are within the 7- 
>> port internal switch zone
>> - Application performs almost as well when it stays within the same  
>> blade chassis and just spans the internal switching modules
>> - Application performance drops to unacceptable when it has to span  
>> multiple blade chassis (the 100mbit link is killing us)
>>
>> This falls into the standard sort of SGE question "how do I pin my  
>> MPI jobs to particular groups of machines?"
>>
>> The normal best practice in this case would be:
>>
>> - Define multiple PEs
>> - Create multiple queues, each with a custom tuned hostlist
>> - Associate a PE to each of the new queues
>> - Wildcard PE submission "qsub -pe MPI* "
>>
>> ... with wildcard PEs we'll be dispatched to a cluster queue that  
>> just so happens to be configured with an ideal hostlist from a  
>> topology perspective.
>>
>>
>> I've got a problem with this though.
>>
>> It requires multiple queues to be set up because PEs themselves no  
>> longer have a hostlist associated with them so any host groupings  
>> need to be done at the queue level. This particular user has bought  
>> entirely into the SGE mindset of "less queues is better" and they  
>> are already doing really excellent with a single "bladeCluster.q"  
>> queue.
>>
>> Forcing a user who has already worked hard to get down to a single  
>> cluster queue to now deploy a bunch of extra queues just to get the  
>> topology-aware MPI host selection is really not going to be an  
>> ideal outcome.
>>
>> I know that the wildcard PE method would work. I would just rather  
>> not configure a ton of new cluster queues just to get one  
>> application performing better.
>>
>> Does anyone have any other methods that may work?
>>
>> How about hostgroups? Can I use wildcards with hostgroups? Can I  
>> submit with multiple comma or space separated queue requests?
>>
>> I was thinking about doing something like this:
>>
>> (1) define hostgroups according to ideal network topology
>> (2) submit jobs with multiple queue requests like so:
>>
>> qsub -q "all.q@@NodeGroup1 all.q@@NodeGroup2 all.q@@NodeGroup3 ... "
>>
>> Ideally wildcards would allow something like:
>>
>> qsub -q all.q@@NodeGrouping*
>>
>>
>>
>>
>> Any tips or recommendations welcome. I'd really like to avoid  
>> configuring multiple queues for one app.
>>
>> -Chris
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list