[GE users] advice re: network topology aware MPI host selection
dag at sonsorol.org
Fri Oct 3 15:23:26 BST 2008
I'm dealing with a latency-sensitive MPI application that is having
some trouble due to network topology issues.
- Standard blade cluster made up of many 14 blade chassis
- 14 server blades per chassis
- In each chassis there are a pair of 7-port Gigabit internal switch
modules w/ 2Gb/sec interconnect between each
- In each chassis there is a 100Mbit external uplink to an
aggregation switch that aggregates all the blade chassis
The cluster is (by design) not meant to support latency sensitive apps
so we are just trying to do the best we can with the network we have.
What we see:
- Application performs great when all MPI nodes are within the 7-
port internal switch zone
- Application performs almost as well when it stays within the same
blade chassis and just spans the internal switching modules
- Application performance drops to unacceptable when it has to span
multiple blade chassis (the 100mbit link is killing us)
This falls into the standard sort of SGE question "how do I pin my MPI
jobs to particular groups of machines?"
The normal best practice in this case would be:
- Define multiple PEs
- Create multiple queues, each with a custom tuned hostlist
- Associate a PE to each of the new queues
- Wildcard PE submission "qsub -pe MPI* "
... with wildcard PEs we'll be dispatched to a cluster queue that just
so happens to be configured with an ideal hostlist from a topology
I've got a problem with this though.
It requires multiple queues to be set up because PEs themselves no
longer have a hostlist associated with them so any host groupings need
to be done at the queue level. This particular user has bought
entirely into the SGE mindset of "less queues is better" and they are
already doing really excellent with a single "bladeCluster.q" queue.
Forcing a user who has already worked hard to get down to a single
cluster queue to now deploy a bunch of extra queues just to get the
topology-aware MPI host selection is really not going to be an ideal
I know that the wildcard PE method would work. I would just rather not
configure a ton of new cluster queues just to get one application
Does anyone have any other methods that may work?
How about hostgroups? Can I use wildcards with hostgroups? Can I
submit with multiple comma or space separated queue requests?
I was thinking about doing something like this:
(1) define hostgroups according to ideal network topology
(2) submit jobs with multiple queue requests like so:
qsub -q "all.q@@NodeGroup1 all.q@@NodeGroup2 all.q@@NodeGroup3 ... "
Ideally wildcards would allow something like:
qsub -q all.q@@NodeGrouping*
Any tips or recommendations welcome. I'd really like to avoid
configuring multiple queues for one app.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users