Locality-aware Connection Management and Rank Assignment for Wide-area MPI
1 / 1

Locality-aware connection establishment - PowerPoint PPT Presentation

  • Uploaded on

Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura (University of Tokyo) {h_saito, tau}@logos.ic.i.u-tokyo.ac.jp. Experimental results. Cluster D (64 nodes). Experimental environment. Overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Locality-aware connection establishment' - benjamin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Hideo Saito Kenjiro Taura(University of Tokyo)

{h_saito, tau}@logos.ic.i.u-tokyo.ac.jp

Experimental results

Cluster D

(64 nodes)

  • Experimental environment


  • Profiling-based optimizations for wide-area message passing systems

    • Locality-aware connection management

    • Locality-aware rank assignment

  • Multi-Cluster MPI (MC-MPI)

    • An adaptive wide-area message passing system that uses the proposed optimizations

    • C and Fortran bindings for most of MPI 1.1

  • Performance evaluation using the NAS Parallel Benchmarks

    • 256 real nodes distributed across 4 clusters

  • The arrows indicate the directions in which connections could be established (Cluster B had a firewall that allowed outgoing connections but prevented incoming connections)

  • The times above the arrows indicate the inter-cluster RTT (the intra-cluster RTT was between 60 and 120 microseconds)

10.8 ms

6.8 ms

6.9 ms

Cluster A

(64 nodes)

4.4 ms

4.3 ms

Cluster B

(64 nodes)

Cluster C

(64 nodes)

0.3 ms

Related work


  • Wide-area message passing systems

    • MPICH-G2 [Karonis et al., ‘03], Grid MPI [Matsuda et al., ‘05], MPICH/MADIII [Aumage et al., ‘03]

  • P2P overlay networks

    • Bamboo [Rhea et al., ‘01]

  • Performance of the NPB with varying numbers of connections

Profiling run

  • Obtain a traffic matrix T and a latency matrix L from a profiling run

  • Traffic matrix (T = {tij})

    • tij: traffic (number of messages) between ranks i and j

    • Execute the application for a short amount of time and make tij the number of messages transmitted during that time

  • Latency matrix (L = {lij})

    • lij: latency (measured or estimated RTT) between processes i and j

    • Use the triangle inequality to estimate RTTs between faraway processes

(c) IS

(b) EP

(a) BT

Locality-aware connection establishment

  • Establish connections between just a subset of all process pairs

    (n: number of processes,: parameter that controls connection density)

    • Select all  of the  processes with the shortest lij

    • Select  of the (2k-1+1)-st to the (2k)-th shortest lij, where the probability that process j will be selected is proportional to tij (k = 1, 2, ..., log2n/)

  • Satisfied properties

    (assume, for simplicity, that the n processes are distributed equally among c clusters)

    • Connections established by each process: O(logn)

    • Inter-cluster connections established: O(nlogc)

  • Build a routing layer using the selected connections

  • Lazy connection establishment

    • Establish selected connections on demand

    • Further reduces the number of connections that are established for applications in which each process only communicates with a few other processes (e.g., SOR)

(f) SP

(e) MG

(d) LU

  • Comparison of lazy connection establishment methods

  • MC-MPI

    •  was selected so that the maximum percentage of connections allowed by each process was 30%

  • MPICH-like

    • Established connections on demand without preselecting candidate connections (another way to think of this is that it preselected all connections)






Locality-aware rank assignment

  • Performance of the NPB with different rank assignments

  • Find a rank-process mapping with low communication overhead

    • Map the rank assignment problem to the Quadratic Assignment Problem

  • Quadratic Assignment Problem (QAP)

    • Given two nxn cost matrices, T and L, find a permutation p of {0, 1, ..., n-1} that minimizes:

  • Solving QAPs

    • The QAP is NP-Hard, but there are heuristics to find good suboptimal solutions

    • Library based on GRASP (Greedy, Randomized, Adaptive Search Procedure) [Resende et al., ‘96]

    • Test against QAPLIB [Burkard et al., ‘97], a publicly available collections of QAPs

      • Instances of up to n = 256

      • n processors for problem size n

      • Approximate solutions that were within one to two percent of the best known solution in under one second

  • QAP (MC-MPI)

    • Assigned ranks based on our locality-aware rank assignment scheme

  • Hostname

    • Sorted the processes by host name and assigned ranks in that order

  • Random

    • Assigned ranks randomly

  • BT, LU and SP

    • MC-MPI performed just as well as Hostname and much better than Random

  • MG

    • MC-MPI outperformed not only Random but also Hostname

    • Rank 0 communicated mostly with ranks 1, 3, 4, 28, 32 and 224

  • EP and IS

    • All three rank assignments performed the same

    • EP involved little communication

    • IS had a uniform communication pattern

Future work

  • An API to allow profiling to be performed within a single run

  • Full paper to appear in CCGRID’07 (Rio de Janeiro, May 14-17, 2007)