1 / 27

How Scalable is Your Load Balancer?

How Scalable is Your Load Balancer?. Ümit V. Çatalyürek , Doruk Bozda ğ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories

toril
Download Presentation

How Scalable is Your Load Balancer?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Scalable is Your Load Balancer? Ümit V. Çatalyürek, Doruk Bozdağ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. = b A x Partitioning and Load Balancing • Goal: assign data to processors to • minimize application runtime • maximize utilization of computing resources • Metrics: • minimize processor idle time (balance workloads) • keep inter-processor communication costs low • Impacts performance of a wide range of simulations Linear solvers & preconditioners Adaptive mesh refinement Contact detection Particle simulations

  3. Roadmap • Background • Graph Partitioning and Partitioning Tools • Hypergraph Partitioning and Partitioning Tools • State-of-the-art: Parallel Multilevel Partitioning • Data Layout/Decomposition • Recursive Bisection • Multilevel Partitioning • Coarsening • Coarse Partitioning • Refinement • Results

  4. Graph Partitioning • Work-horse of load-balancing community. • Highly successful model for PDE problems. • Model problem as a graph: • vertices = work associated with data (computation) • edges = relationships between data/computation (communication) • Goal: Evenly distribute vertex weight while minimizing weight of cut edges. • Many successful algorithms • Kernighan, Lin, Simon, Hendrickson, Leland, Kumar, Karypis, et al. • Excellent software available • Serial: Chaco (SNL), Jostle (U. Greenwich), METIS (U. Minn.), Party (U. Paderborn), Scotch (U. Bordeaux) • Parallel: ParMETIS (U. Minn.), PJostle (U. Greenwich)

  5. Hexahedral finiteelement matrix Linear programming matrix for sensor placement Limited Applicability of Graph Models • Assume symmetric square problems. • Symmetric = undirected graph. • Square = inputs and outputs of operation are same size. • Do not naturally support: • Non-symmetric systems. • Require directed or bipartite graph. • Partition A+AT. • Rectangular systems. • Require decompositions fordifferently sized inputs and outputs. • Partition AAT.

  6. P2 P1 Vj Vk Vi Vl Vm Vh P3 P4 Hexahedral finiteelement matrix Xyce ASIC matrix Approximate Communication Metric in Graph Models • Graph models assume • Weight of edge cuts = Communication volume. • But edge cuts only approximate communication volume. • Good enough for many PDE applications. • Not good enough for irregular problems

  7. Hypergraph Partitioning • Work-horse of VLSI community • Recently adopted by scientific community • Model problem as a hypergraph (Çatalyürek & Aykanat): • vertices = work associated with data (computation) • edges = data elements, computation-to-data dependencies (communication) • Goal: Evenly distribute vertex weight while minimizing cut size. • Many successful algorithms • Kernighan, Schweikert, Fiduccia, Mattheyes, Sanchis, Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. • Excellent serial software available • hMETIS (Karypis), – PaToH (Çatalyürek) • Mondriaan (Bisseling) • Parallel partitioners needed for large, dynamic problems. • Zoltan PHG (Sandia) – Parkway (Trifunovic)

  8. P1 Vj nk Vk Vi nl Vl ni nm Vm nh Vh P3 P4 Impact of Hypergraph Models • Greater expressiveness  Greater applicability. • Structurally non-symmetric systems • circuits, biology • Rectangular systems • linear programming, least-squares methods • Non-homogeneous, highly connected topologies • circuits, nanotechnology, databases • Multiple models for different granularity partitioning • Owner compute, fine-grain, checkerboard/cartesian, Mondriaan • Accurate communication model  lower application communication costs. Mondriaan Partitioning Courtesy of Rob Bisseling

  9. Parallel Graph/Hypergraph Partitioning

  10. Design Decisions • Single-level vs Multi-level [serial/parallel] • Direct k-way vs Recursive Bisection [serial/parallel] • Data Layout [parallel] • Chicken-and-egg Problem

  11. Multilevel Partitioning Scheme

  12. Recursive Bisection • Recursive bisection approach: • Partition data into two sets. • Recursively subdivide each set into two sets. • Only minor modifications needed to allow P ≠ 2n. • How to parallelize: • Start on single node; use additional nodes in each recursion • Doesn’t solve Memory Problem • Solve each recursion (hence each level of multilevel) in parallel • At each recursion two split options: • Split only the data into two sets; use all processors to compute each branch. • Split both the data and processors into two sets; solve branches in parallel.

  13. Parallel Data Layout: Graph

  14. Parallel Data Layout: Graph

  15. Parallel Data Layout: Hypergraph • Zoltan uses 2D data layout within hypergraph partitioner. • Matrix representation of Hypergraphs (Çatalyürek & Aykanat) • vertices == columns • nets (hyperedges) == rows • Vertex/hyperedge broadcasts to only sqrt(P) processors. • Maintain scalable memory usage. • No “ghosting” of off-processor neighbor info. • Differs from parallel graph partitioners and Parkway.

  16. Coarsening via Parallel Matching • Vertex-based greedy maximal weight matching algorithms • Graph: Heavy Edge Matching • Hypergraph: Heavy connectivity matching (Çatalyürek) Inner-product matching (Bisseling) • Match columns (vertices) with greatest inner product  greatest similarity in connectivity • BSP style: with and without Coloring/PhaseS or Rounds • ParMETIS: 2C or 2S (S~=4) communication steps • Each processor concurrently decides possibly conflincting matching • 2 communications to resolve conflicts • Zoltan: 4R communication steps (each w/ max procs) • Broadcast subset of vertices (“candidates”) along processor row • Compute (partial) inner products of received candidates with local vertices • Accrue inner products in processor column • Identify best local matches for received candidates • Send best matches to candidates’ owners • Select best global match for each owned candidate • Send “match accepted” messages to processors owning matched vertices

  17. Contraction • Merge matched vertices • Graph/Hypergraph: communicate match array • Graph: • communicate adj list of a matched vertices from different processors • Hypergraph: • Merging vertices reduces only one dimension • To reduce the other dimension: • Remove size 1 nets • Remove identical nets • Compute hash via horizontal communication • Compare nets via vertical communication

  18. Coarse Partitioning • Gathers coarsest graph/hypergraph to each processor • Bisection/k-way (Zoltan): • Compute several different coarse partitions on each processor • Select best local partition • Compute best over all processors • k-way (ParMETIS): • Do recursive bisection; each proc follows single path in the recursion

  19. Uncoarsening/Refinement • Project coarse partition to finer graph/hypergraph • Use local optimization (KL/FM) to improve balance and reduce cuts • Repeat until you reach finest graph/hypergraph • Each FM round consist of number of phaseS (Colors) • Graph: C or S=2 communication steps per round (max 2 rounds) • Hypergraph: 3 communication steps per round (max 10 rounds) but concurrency is limited to • Compute “root” processor in each processor column: processor with most nonzeros • Compute and communicate pin distribution • Compute andcommunicate gains • Root processor computes moves for vertices in processor column • Communicate best moves and update pin distribution

  20. Xyce ASIC Stripped Cage Electrophoresis Experimental Results • Experiments on • Thunderbird cluster • 4,480 compute nodes connected with Infiniband • Dual 3.6 GHz Intel EM64T processors with 6 GB RAM • BMI-RI cluster • 64 compute nodes connected with Infiniband • Dual 2.4 GHz AMD Opteron processors with 8 GB RAM • Zoltan v2.1 hypergraph partitioner & ParMETIS v3.1 graph partitioner • Test problems: • Xyce ASIC Stripped: 680K x 680K; 2.3M nonzeros • Cage15 Electrophoresis: 5.1M x 5.1M; 99M nonzeros • Mrng4: 7.5M x 7.5M; ~67M nonzeros • 2DGrid: 16M x 16M; ~88M nonzeros

  21. Results: Weak Scaling • On BMI-RI cluster • cage12 ~130K vertices, cage13 is 3.4x, cage14 is 11.5x, and cage15 is 39.5x of cage12

  22. Results: Weak Scaling • On BMI-RI cluster • ParMETIS on Cray T3E uses SHMEM • mrng1 ~500K vertices (p=2), mrng2 is 4x, mrng3 is 16x, mrng4 is 32x of mrng1

  23. Results: Weak Scaling • On BMI-RI cluster • On the average 15% smaller communication volume with hypergraph!?

  24. Results: Strong Scaling • On Thunderbird cluster • On the average 17% smaller communication volume with hypergraph

  25. Results: Strong Scaling • On Thunderbird cluster • Balance problems with ParMETIS

  26. Conclusion • Do we need parallel partitioning? • It is a must but not for speed; because of memory! Big problems need big machines! • Do load-balancers scale? • Yes they do scale: but they need to be improved to scale to Petascale • Memory usage becomes a problem for larger problems and #Procs • One size won’t fit all : many things depends on application • Model depends on application • Trade-Offs • Techniques may depend on application/data • Performance depends on application/data => bring us more applications/data :)

  27. CSCAPES:Combinatorial Scientific Computing and Petascale Simulation • A SciDAC Institute Funded by DOE’s Office of Science Alex Pothen, Florin Dobrian, Assefaw Gebremedhin Erik Boman, Karen Devine, Bruce Hendrickson Paul Hovland, Boyana Norris, Jean Utke Michelle Strout Umit Catalyurek www.cscapes.org

More Related