- By
**toril** - Follow User

- 298 Views
- Uploaded on

Download Presentation
## How Scalable is Your Load Balancer?

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### How Scalable is Your Load Balancer?

Ümit V. Çatalyürek, Doruk Bozdağ

The Ohio State University

Karen D. Devine, Erik G. Boman

Sandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

b

A

x

Partitioning and Load Balancing- Goal: assign data to processors to
- minimize application runtime
- maximize utilization of computing resources
- Metrics:
- minimize processor idle time (balance workloads)
- keep inter-processor communication costs low
- Impacts performance of a wide range of simulations

Linear solvers &

preconditioners

Adaptive mesh refinement

Contact detection

Particle simulations

Roadmap

- Background
- Graph Partitioning and Partitioning Tools
- Hypergraph Partitioning and Partitioning Tools
- State-of-the-art: Parallel Multilevel Partitioning
- Data Layout/Decomposition
- Recursive Bisection
- Multilevel Partitioning
- Coarsening
- Coarse Partitioning
- Refinement
- Results

Graph Partitioning

- Work-horse of load-balancing community.
- Highly successful model for PDE problems.
- Model problem as a graph:
- vertices = work associated with data (computation)
- edges = relationships between data/computation (communication)
- Goal: Evenly distribute vertex weight while minimizing weight of cut edges.
- Many successful algorithms
- Kernighan, Lin, Simon, Hendrickson, Leland, Kumar, Karypis, et al.
- Excellent software available
- Serial: Chaco (SNL), Jostle (U. Greenwich), METIS (U. Minn.), Party (U. Paderborn), Scotch (U. Bordeaux)
- Parallel: ParMETIS (U. Minn.), PJostle (U. Greenwich)

Hexahedral finiteelement matrix

Linear programming matrix for sensor placement

Limited Applicability of Graph Models- Assume symmetric square problems.
- Symmetric = undirected graph.
- Square = inputs and outputs of operation are same size.
- Do not naturally support:
- Non-symmetric systems.
- Require directed or bipartite graph.
- Partition A+AT.
- Rectangular systems.
- Require decompositions fordifferently sized inputs and outputs.
- Partition AAT.

P1

Vj

Vk

Vi

Vl

Vm

Vh

P3

P4

Hexahedral finiteelement matrix

Xyce ASIC matrix

Approximate Communication Metric in Graph Models- Graph models assume
- Weight of edge cuts = Communication volume.
- But edge cuts only approximate communication volume.
- Good enough for many PDE applications.
- Not good enough for irregular problems

Hypergraph Partitioning

- Work-horse of VLSI community
- Recently adopted by scientific community
- Model problem as a hypergraph (Çatalyürek & Aykanat):
- vertices = work associated with data (computation)
- edges = data elements, computation-to-data dependencies (communication)
- Goal: Evenly distribute vertex weight while minimizing cut size.
- Many successful algorithms
- Kernighan, Schweikert, Fiduccia, Mattheyes, Sanchis, Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al.
- Excellent serial software available
- hMETIS (Karypis), – PaToH (Çatalyürek)
- Mondriaan (Bisseling)
- Parallel partitioners needed for large, dynamic problems.
- Zoltan PHG (Sandia) – Parkway (Trifunovic)

Vj

nk

Vk

Vi

nl

Vl

ni

nm

Vm

nh

Vh

P3

P4

Impact of Hypergraph Models- Greater expressiveness Greater applicability.
- Structurally non-symmetric systems
- circuits, biology
- Rectangular systems
- linear programming, least-squares methods
- Non-homogeneous, highly connected topologies
- circuits, nanotechnology, databases
- Multiple models for different granularity partitioning
- Owner compute, fine-grain, checkerboard/cartesian, Mondriaan
- Accurate communication model lower application communication costs.

Mondriaan Partitioning

Courtesy of Rob Bisseling

Design Decisions

- Single-level vs Multi-level [serial/parallel]
- Direct k-way vs Recursive Bisection [serial/parallel]
- Data Layout [parallel]
- Chicken-and-egg Problem

Recursive Bisection

- Recursive bisection approach:
- Partition data into two sets.
- Recursively subdivide each set into two sets.
- Only minor modifications needed to allow P ≠ 2n.
- How to parallelize:
- Start on single node; use additional nodes in each recursion
- Doesn’t solve Memory Problem
- Solve each recursion (hence each level of multilevel) in parallel
- At each recursion two split options:
- Split only the data into two sets; use all processors to compute each branch.
- Split both the data and processors into two sets; solve branches in parallel.

Parallel Data Layout: Hypergraph

- Zoltan uses 2D data layout within hypergraph partitioner.
- Matrix representation of Hypergraphs (Çatalyürek & Aykanat)
- vertices == columns
- nets (hyperedges) == rows
- Vertex/hyperedge broadcasts to only sqrt(P) processors.
- Maintain scalable memory usage.
- No “ghosting” of off-processor neighbor info.
- Differs from parallel graph partitioners and Parkway.

Coarsening via Parallel Matching

- Vertex-based greedy maximal weight matching algorithms
- Graph: Heavy Edge Matching
- Hypergraph: Heavy connectivity matching (Çatalyürek) Inner-product matching (Bisseling)
- Match columns (vertices) with greatest inner product greatest similarity in connectivity
- BSP style: with and without Coloring/PhaseS or Rounds
- ParMETIS: 2C or 2S (S~=4) communication steps
- Each processor concurrently decides possibly conflincting matching
- 2 communications to resolve conflicts
- Zoltan: 4R communication steps (each w/ max procs)
- Broadcast subset of vertices (“candidates”) along processor row
- Compute (partial) inner products of received candidates with local vertices
- Accrue inner products in processor column
- Identify best local matches for received candidates
- Send best matches to candidates’ owners
- Select best global match for each owned candidate
- Send “match accepted” messages to processors owning matched vertices

Contraction

- Merge matched vertices
- Graph/Hypergraph: communicate match array
- Graph:
- communicate adj list of a matched vertices from different processors
- Hypergraph:
- Merging vertices reduces only one dimension
- To reduce the other dimension:
- Remove size 1 nets
- Remove identical nets
- Compute hash via horizontal communication
- Compare nets via vertical communication

Coarse Partitioning

- Gathers coarsest graph/hypergraph to each processor
- Bisection/k-way (Zoltan):
- Compute several different coarse partitions on each processor
- Select best local partition
- Compute best over all processors
- k-way (ParMETIS):
- Do recursive bisection; each proc follows single path in the recursion

Uncoarsening/Refinement

- Project coarse partition to finer graph/hypergraph
- Use local optimization (KL/FM) to improve balance and reduce cuts
- Repeat until you reach finest graph/hypergraph
- Each FM round consist of number of phaseS (Colors)
- Graph: C or S=2 communication steps per round (max 2 rounds)
- Hypergraph: 3 communication steps per round (max 10 rounds) but concurrency is limited to
- Compute “root” processor in each processor column: processor with most nonzeros
- Compute and communicate pin distribution
- Compute andcommunicate gains
- Root processor computes moves for vertices in processor column
- Communicate best moves and update pin distribution

Cage Electrophoresis

Experimental Results- Experiments on
- Thunderbird cluster
- 4,480 compute nodes connected with Infiniband
- Dual 3.6 GHz Intel EM64T processors with 6 GB RAM
- BMI-RI cluster
- 64 compute nodes connected with Infiniband
- Dual 2.4 GHz AMD Opteron processors with 8 GB RAM
- Zoltan v2.1 hypergraph partitioner & ParMETIS v3.1 graph partitioner
- Test problems:
- Xyce ASIC Stripped: 680K x 680K; 2.3M nonzeros
- Cage15 Electrophoresis: 5.1M x 5.1M; 99M nonzeros
- Mrng4: 7.5M x 7.5M; ~67M nonzeros
- 2DGrid: 16M x 16M; ~88M nonzeros

Results: Weak Scaling

- On BMI-RI cluster
- cage12 ~130K vertices, cage13 is 3.4x, cage14 is 11.5x, and cage15 is 39.5x of cage12

Results: Weak Scaling

- On BMI-RI cluster
- ParMETIS on Cray T3E uses SHMEM
- mrng1 ~500K vertices (p=2), mrng2 is 4x, mrng3 is 16x, mrng4 is 32x of mrng1

Results: Weak Scaling

- On BMI-RI cluster
- On the average 15% smaller communication volume with hypergraph!?

Results: Strong Scaling

- On Thunderbird cluster
- On the average 17% smaller communication volume with hypergraph

Results: Strong Scaling

- On Thunderbird cluster
- Balance problems with ParMETIS

Conclusion

- Do we need parallel partitioning?
- It is a must but not for speed; because of memory! Big problems need big machines!
- Do load-balancers scale?
- Yes they do scale: but they need to be improved to scale to Petascale
- Memory usage becomes a problem for larger problems and #Procs
- One size won’t fit all : many things depends on application
- Model depends on application
- Trade-Offs
- Techniques may depend on application/data
- Performance depends on application/data => bring us more applications/data :)

CSCAPES:Combinatorial Scientific Computing and Petascale Simulation

- A SciDAC Institute Funded by DOE’s Office of Science

Alex Pothen,

Florin Dobrian,

Assefaw Gebremedhin

Erik Boman,

Karen Devine,

Bruce Hendrickson

Paul Hovland,

Boyana Norris,

Jean Utke

Michelle Strout

Umit Catalyurek

www.cscapes.org

Download Presentation

Connecting to Server..