[GE users] advice re: network topology aware MPI host selection

Daniel Templeton Dan.Templeton at Sun.COM
Fri Oct 3 15:32:37 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Why not just add the PEs to the queue instances instead of to the queue, 
e.g.:

%qconf -sq bladeCluster.q | grep pe_list
pe_list     make,[@chassis1=make MPI1],[@chassis2=make MPI2]

or more clearly:

%qconf -sq bladeCluster.q@@chassis1 | grep pe_list
pe_list     make MPI1

Daniel

Chris Dagdigian wrote:
>
> Hi folks,
>
> I'm dealing with a latency-sensitive MPI application that is having 
> some trouble due to network topology issues.
>
> The scenario:
>
>  - Standard blade cluster made up of many 14 blade chassis
>  - 14 server blades per chassis
>  - In each chassis there are a pair of 7-port Gigabit internal switch 
> modules w/ 2Gb/sec interconnect between each
>  - In each chassis there is a 100Mbit external uplink to an 
> aggregation switch that aggregates all the blade chassis
>
> The cluster is (by design) not meant to support latency sensitive apps 
> so we are just trying to do the best we can with the network we have.
>
> What we see:
>
>  - Application performs great when all MPI nodes are within the 7-port 
> internal switch zone
>  - Application performs almost as well when it stays within the same 
> blade chassis and just spans the internal switching modules
>  - Application performance drops to unacceptable when it has to span 
> multiple blade chassis (the 100mbit link is killing us)
>
> This falls into the standard sort of SGE question "how do I pin my MPI 
> jobs to particular groups of machines?"
>
> The normal best practice in this case would be:
>
>  - Define multiple PEs
>  - Create multiple queues, each with a custom tuned hostlist
>  - Associate a PE to each of the new queues
>  - Wildcard PE submission "qsub -pe MPI* "
>
> ... with wildcard PEs we'll be dispatched to a cluster queue that just 
> so happens to be configured with an ideal hostlist from a topology 
> perspective.
>
>
> I've got a problem with this though.
>
> It requires multiple queues to be set up because PEs themselves no 
> longer have a hostlist associated with them so any host groupings need 
> to be done at the queue level. This particular user has bought 
> entirely into the SGE mindset of "less queues is better" and they are 
> already doing really excellent with a single "bladeCluster.q" queue.
>
> Forcing a user who has already worked hard to get down to a single 
> cluster queue to now deploy a bunch of extra queues just to get the 
> topology-aware MPI host selection is really not going to be an ideal 
> outcome.
>
> I know that the wildcard PE method would work. I would just rather not 
> configure a ton of new cluster queues just to get one application 
> performing better.
>
> Does anyone have any other methods that may work?
>
> How about hostgroups? Can I use wildcards with hostgroups? Can I 
> submit with multiple comma or space separated queue requests?
>
> I was thinking about doing something like this:
>
> (1) define hostgroups according to ideal network topology
> (2) submit jobs with multiple queue requests like so:
>
> qsub -q "all.q@@NodeGroup1 all.q@@NodeGroup2 all.q@@NodeGroup3 ... "
>
> Ideally wildcards would allow something like:
>
>  qsub -q all.q@@NodeGrouping*
>
>
>
>
> Any tips or recommendations welcome. I'd really like to avoid 
> configuring multiple queues for one app.
>
> -Chris
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list