slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Locality-aware connection establishment PowerPoint Presentation
Download Presentation
Locality-aware connection establishment

Loading in 2 Seconds...

play fullscreen
1 / 1

Locality-aware connection establishment - PowerPoint PPT Presentation

  • Uploaded on

Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura (University of Tokyo) {h_saito, tau} Experimental results. Cluster D (64 nodes). Experimental environment. Overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Locality-aware connection establishment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Hideo Saito Kenjiro Taura(University of Tokyo)

{h_saito, tau}

Experimental results

Cluster D

(64 nodes)

  • Experimental environment


  • Profiling-based optimizations for wide-area message passing systems
    • Locality-aware connection management
    • Locality-aware rank assignment
  • Multi-Cluster MPI (MC-MPI)
    • An adaptive wide-area message passing system that uses the proposed optimizations
    • C and Fortran bindings for most of MPI 1.1
  • Performance evaluation using the NAS Parallel Benchmarks
    • 256 real nodes distributed across 4 clusters
  • The arrows indicate the directions in which connections could be established (Cluster B had a firewall that allowed outgoing connections but prevented incoming connections)
  • The times above the arrows indicate the inter-cluster RTT (the intra-cluster RTT was between 60 and 120 microseconds)

10.8 ms

6.8 ms

6.9 ms

Cluster A

(64 nodes)

4.4 ms

4.3 ms

Cluster B

(64 nodes)

Cluster C

(64 nodes)

0.3 ms

Related work


  • Wide-area message passing systems
    • MPICH-G2 [Karonis et al., ‘03], Grid MPI [Matsuda et al., ‘05], MPICH/MADIII [Aumage et al., ‘03]
  • P2P overlay networks
    • Bamboo [Rhea et al., ‘01]
  • Performance of the NPB with varying numbers of connections

Profiling run

  • Obtain a traffic matrix T and a latency matrix L from a profiling run
  • Traffic matrix (T = {tij})
    • tij: traffic (number of messages) between ranks i and j
    • Execute the application for a short amount of time and make tij the number of messages transmitted during that time
  • Latency matrix (L = {lij})
    • lij: latency (measured or estimated RTT) between processes i and j
    • Use the triangle inequality to estimate RTTs between faraway processes

(c) IS

(b) EP

(a) BT

Locality-aware connection establishment

  • Establish connections between just a subset of all process pairs

(n: number of processes,: parameter that controls connection density)

    • Select all  of the  processes with the shortest lij
    • Select  of the (2k-1+1)-st to the (2k)-th shortest lij, where the probability that process j will be selected is proportional to tij (k = 1, 2, ..., log2n/)
  • Satisfied properties

(assume, for simplicity, that the n processes are distributed equally among c clusters)

    • Connections established by each process: O(logn)
    • Inter-cluster connections established: O(nlogc)
  • Build a routing layer using the selected connections
  • Lazy connection establishment
    • Establish selected connections on demand
    • Further reduces the number of connections that are established for applications in which each process only communicates with a few other processes (e.g., SOR)

(f) SP

(e) MG

(d) LU

  • Comparison of lazy connection establishment methods
  • MC-MPI
    •  was selected so that the maximum percentage of connections allowed by each process was 30%
  • MPICH-like
    • Established connections on demand without preselecting candidate connections (another way to think of this is that it preselected all connections)






Locality-aware rank assignment

  • Performance of the NPB with different rank assignments
  • Find a rank-process mapping with low communication overhead
    • Map the rank assignment problem to the Quadratic Assignment Problem
  • Quadratic Assignment Problem (QAP)
    • Given two nxn cost matrices, T and L, find a permutation p of {0, 1, ..., n-1} that minimizes:
  • Solving QAPs
    • The QAP is NP-Hard, but there are heuristics to find good suboptimal solutions
    • Library based on GRASP (Greedy, Randomized, Adaptive Search Procedure) [Resende et al., ‘96]
    • Test against QAPLIB [Burkard et al., ‘97], a publicly available collections of QAPs
      • Instances of up to n = 256
      • n processors for problem size n
      • Approximate solutions that were within one to two percent of the best known solution in under one second
  • QAP (MC-MPI)
    • Assigned ranks based on our locality-aware rank assignment scheme
  • Hostname
    • Sorted the processes by host name and assigned ranks in that order
  • Random
    • Assigned ranks randomly
  • BT, LU and SP
    • MC-MPI performed just as well as Hostname and much better than Random
  • MG
    • MC-MPI outperformed not only Random but also Hostname
    • Rank 0 communicated mostly with ranks 1, 3, 4, 28, 32 and 224
  • EP and IS
    • All three rank assignments performed the same
    • EP involved little communication
    • IS had a uniform communication pattern

Future work

  • An API to allow profiling to be performed within a single run
  • Full paper to appear in CCGRID’07 (Rio de Janeiro, May 14-17, 2007)