[GE users] advice re: network topology aware MPI host selection

Chris Dagdigian dag at sonsorol.org
Fri Oct 3 15:23:26 BST 2008

Hi folks,

I'm dealing with a latency-sensitive MPI application that is having  
some trouble due to network topology issues.

The scenario:

  - Standard blade cluster made up of many 14 blade chassis
  - 14 server blades per chassis
  - In each chassis there are a pair of 7-port Gigabit internal switch  
modules w/ 2Gb/sec interconnect between each
  - In each chassis there is a 100Mbit external uplink to an  
aggregation switch that aggregates all the blade chassis

The cluster is (by design) not meant to support latency sensitive apps  
so we are just trying to do the best we can with the network we have.

What we see:

  - Application performs great when all MPI nodes are within the 7- 
port internal switch zone
  - Application performs almost as well when it stays within the same  
blade chassis and just spans the internal switching modules
  - Application performance drops to unacceptable when it has to span  
multiple blade chassis (the 100mbit link is killing us)

This falls into the standard sort of SGE question "how do I pin my MPI  
jobs to particular groups of machines?"

The normal best practice in this case would be:

  - Define multiple PEs
  - Create multiple queues, each with a custom tuned hostlist
  - Associate a PE to each of the new queues
  - Wildcard PE submission "qsub -pe MPI* "

... with wildcard PEs we'll be dispatched to a cluster queue that just  
so happens to be configured with an ideal hostlist from a topology  

I've got a problem with this though.

It requires multiple queues to be set up because PEs themselves no  
longer have a hostlist associated with them so any host groupings need  
to be done at the queue level. This particular user has bought  
entirely into the SGE mindset of "less queues is better" and they are  
already doing really excellent with a single "bladeCluster.q" queue.

Forcing a user who has already worked hard to get down to a single  
cluster queue to now deploy a bunch of extra queues just to get the  
topology-aware MPI host selection is really not going to be an ideal  

I know that the wildcard PE method would work. I would just rather not  
configure a ton of new cluster queues just to get one application  
performing better.

Does anyone have any other methods that may work?

How about hostgroups? Can I use wildcards with hostgroups? Can I  
submit with multiple comma or space separated queue requests?

I was thinking about doing something like this:

(1) define hostgroups according to ideal network topology
(2) submit jobs with multiple queue requests like so:

qsub -q "all.q@@NodeGroup1 all.q@@NodeGroup2 all.q@@NodeGroup3 ... "

Ideally wildcards would allow something like:

  qsub -q all.q@@NodeGrouping*

Any tips or recommendations welcome. I'd really like to avoid  
configuring multiple queues for one app.


