how scalable is your load balancer l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
How Scalable is Your Load Balancer? PowerPoint Presentation
Download Presentation
How Scalable is Your Load Balancer?

Loading in 2 Seconds...

play fullscreen
1 / 27

How Scalable is Your Load Balancer? - PowerPoint PPT Presentation


  • 298 Views
  • Uploaded on

How Scalable is Your Load Balancer?. Ümit V. Çatalyürek , Doruk Bozda ğ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

How Scalable is Your Load Balancer?


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
how scalable is your load balancer

How Scalable is Your Load Balancer?

Ümit V. Çatalyürek, Doruk Bozdağ

The Ohio State University

Karen D. Devine, Erik G. Boman

Sandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

partitioning and load balancing

=

b

A

x

Partitioning and Load Balancing
  • Goal: assign data to processors to
    • minimize application runtime
    • maximize utilization of computing resources
  • Metrics:
    • minimize processor idle time (balance workloads)
    • keep inter-processor communication costs low
  • Impacts performance of a wide range of simulations

Linear solvers &

preconditioners

Adaptive mesh refinement

Contact detection

Particle simulations

roadmap
Roadmap
  • Background
    • Graph Partitioning and Partitioning Tools
    • Hypergraph Partitioning and Partitioning Tools
  • State-of-the-art: Parallel Multilevel Partitioning
    • Data Layout/Decomposition
    • Recursive Bisection
    • Multilevel Partitioning
      • Coarsening
      • Coarse Partitioning
      • Refinement
    • Results
graph partitioning
Graph Partitioning
  • Work-horse of load-balancing community.
  • Highly successful model for PDE problems.
  • Model problem as a graph:
    • vertices = work associated with data (computation)
    • edges = relationships between data/computation (communication)
  • Goal: Evenly distribute vertex weight while minimizing weight of cut edges.
  • Many successful algorithms
    • Kernighan, Lin, Simon, Hendrickson, Leland, Kumar, Karypis, et al.
  • Excellent software available
    • Serial: Chaco (SNL), Jostle (U. Greenwich), METIS (U. Minn.), Party (U. Paderborn), Scotch (U. Bordeaux)
    • Parallel: ParMETIS (U. Minn.), PJostle (U. Greenwich)
limited applicability of graph models

Hexahedral finiteelement matrix

Linear programming matrix for sensor placement

Limited Applicability of Graph Models
  • Assume symmetric square problems.
    • Symmetric = undirected graph.
    • Square = inputs and outputs of operation are same size.
  • Do not naturally support:
    • Non-symmetric systems.
      • Require directed or bipartite graph.
      • Partition A+AT.
    • Rectangular systems.
      • Require decompositions fordifferently sized inputs and outputs.
      • Partition AAT.
approximate communication metric in graph models

P2

P1

Vj

Vk

Vi

Vl

Vm

Vh

P3

P4

Hexahedral finiteelement matrix

Xyce ASIC matrix

Approximate Communication Metric in Graph Models
  • Graph models assume
    • Weight of edge cuts = Communication volume.
  • But edge cuts only approximate communication volume.
  • Good enough for many PDE applications.
  • Not good enough for irregular problems
hypergraph partitioning
Hypergraph Partitioning
  • Work-horse of VLSI community
  • Recently adopted by scientific community
  • Model problem as a hypergraph (Çatalyürek & Aykanat):
    • vertices = work associated with data (computation)
    • edges = data elements, computation-to-data dependencies (communication)
  • Goal: Evenly distribute vertex weight while minimizing cut size.
  • Many successful algorithms
    • Kernighan, Schweikert, Fiduccia, Mattheyes, Sanchis, Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al.
  • Excellent serial software available
    • hMETIS (Karypis), – PaToH (Çatalyürek)
    • Mondriaan (Bisseling)
  • Parallel partitioners needed for large, dynamic problems.
    • Zoltan PHG (Sandia) – Parkway (Trifunovic)
impact of hypergraph models

P1

Vj

nk

Vk

Vi

nl

Vl

ni

nm

Vm

nh

Vh

P3

P4

Impact of Hypergraph Models
  • Greater expressiveness  Greater applicability.
    • Structurally non-symmetric systems
      • circuits, biology
    • Rectangular systems
      • linear programming, least-squares methods
    • Non-homogeneous, highly connected topologies
      • circuits, nanotechnology, databases
    • Multiple models for different granularity partitioning
      • Owner compute, fine-grain, checkerboard/cartesian, Mondriaan
  • Accurate communication model  lower application communication costs.

Mondriaan Partitioning

Courtesy of Rob Bisseling

design decisions
Design Decisions
  • Single-level vs Multi-level [serial/parallel]
  • Direct k-way vs Recursive Bisection [serial/parallel]
  • Data Layout [parallel]
    • Chicken-and-egg Problem
recursive bisection
Recursive Bisection
  • Recursive bisection approach:
    • Partition data into two sets.
    • Recursively subdivide each set into two sets.
    • Only minor modifications needed to allow P ≠ 2n.
  • How to parallelize:
    • Start on single node; use additional nodes in each recursion
      • Doesn’t solve Memory Problem
    • Solve each recursion (hence each level of multilevel) in parallel
      • At each recursion two split options:
        • Split only the data into two sets; use all processors to compute each branch.
        • Split both the data and processors into two sets; solve branches in parallel.
parallel data layout hypergraph
Parallel Data Layout: Hypergraph
  • Zoltan uses 2D data layout within hypergraph partitioner.
    • Matrix representation of Hypergraphs (Çatalyürek & Aykanat)
      • vertices == columns
      • nets (hyperedges) == rows
    • Vertex/hyperedge broadcasts to only sqrt(P) processors.
    • Maintain scalable memory usage.
      • No “ghosting” of off-processor neighbor info.
      • Differs from parallel graph partitioners and Parkway.
coarsening via parallel matching
Coarsening via Parallel Matching
  • Vertex-based greedy maximal weight matching algorithms
    • Graph: Heavy Edge Matching
    • Hypergraph: Heavy connectivity matching (Çatalyürek) Inner-product matching (Bisseling)
      • Match columns (vertices) with greatest inner product  greatest similarity in connectivity
  • BSP style: with and without Coloring/PhaseS or Rounds
  • ParMETIS: 2C or 2S (S~=4) communication steps
    • Each processor concurrently decides possibly conflincting matching
    • 2 communications to resolve conflicts
  • Zoltan: 4R communication steps (each w/ max procs)
    • Broadcast subset of vertices (“candidates”) along processor row
    • Compute (partial) inner products of received candidates with local vertices
    • Accrue inner products in processor column
    • Identify best local matches for received candidates
    • Send best matches to candidates’ owners
    • Select best global match for each owned candidate
    • Send “match accepted” messages to processors owning matched vertices
contraction
Contraction
  • Merge matched vertices
    • Graph/Hypergraph: communicate match array
  • Graph:
    • communicate adj list of a matched vertices from different processors
  • Hypergraph:
    • Merging vertices reduces only one dimension
    • To reduce the other dimension:
      • Remove size 1 nets
      • Remove identical nets
        • Compute hash via horizontal communication
        • Compare nets via vertical communication
coarse partitioning
Coarse Partitioning
  • Gathers coarsest graph/hypergraph to each processor
    • Bisection/k-way (Zoltan):
      • Compute several different coarse partitions on each processor
      • Select best local partition
      • Compute best over all processors
    • k-way (ParMETIS):
      • Do recursive bisection; each proc follows single path in the recursion
uncoarsening refinement
Uncoarsening/Refinement
  • Project coarse partition to finer graph/hypergraph
  • Use local optimization (KL/FM) to improve balance and reduce cuts
  • Repeat until you reach finest graph/hypergraph
  • Each FM round consist of number of phaseS (Colors)
  • Graph: C or S=2 communication steps per round (max 2 rounds)
  • Hypergraph: 3 communication steps per round (max 10 rounds) but concurrency is limited to
      • Compute “root” processor in each processor column: processor with most nonzeros
      • Compute and communicate pin distribution
      • Compute andcommunicate gains
      • Root processor computes moves for vertices in processor column
      • Communicate best moves and update pin distribution
experimental results

Xyce ASIC Stripped

Cage Electrophoresis

Experimental Results
  • Experiments on
    • Thunderbird cluster
      • 4,480 compute nodes connected with Infiniband
      • Dual 3.6 GHz Intel EM64T processors with 6 GB RAM
    • BMI-RI cluster
      • 64 compute nodes connected with Infiniband
      • Dual 2.4 GHz AMD Opteron processors with 8 GB RAM
  • Zoltan v2.1 hypergraph partitioner & ParMETIS v3.1 graph partitioner
  • Test problems:
    • Xyce ASIC Stripped: 680K x 680K; 2.3M nonzeros
    • Cage15 Electrophoresis: 5.1M x 5.1M; 99M nonzeros
    • Mrng4: 7.5M x 7.5M; ~67M nonzeros
    • 2DGrid: 16M x 16M; ~88M nonzeros
results weak scaling
Results: Weak Scaling
  • On BMI-RI cluster
  • cage12 ~130K vertices, cage13 is 3.4x, cage14 is 11.5x, and cage15 is 39.5x of cage12
results weak scaling22
Results: Weak Scaling
  • On BMI-RI cluster
  • ParMETIS on Cray T3E uses SHMEM
  • mrng1 ~500K vertices (p=2), mrng2 is 4x, mrng3 is 16x, mrng4 is 32x of mrng1
results weak scaling23
Results: Weak Scaling
  • On BMI-RI cluster
  • On the average 15% smaller communication volume with hypergraph!?
results strong scaling
Results: Strong Scaling
  • On Thunderbird cluster
  • On the average 17% smaller communication volume with hypergraph
results strong scaling25
Results: Strong Scaling
  • On Thunderbird cluster
  • Balance problems with ParMETIS
conclusion
Conclusion
  • Do we need parallel partitioning?
    • It is a must but not for speed; because of memory! Big problems need big machines!
  • Do load-balancers scale?
    • Yes they do scale: but they need to be improved to scale to Petascale
      • Memory usage becomes a problem for larger problems and #Procs
  • One size won’t fit all : many things depends on application
    • Model depends on application
    • Trade-Offs
    • Techniques may depend on application/data
    • Performance depends on application/data => bring us more applications/data :)
cscapes combinatorial scientific computing and petascale simulation
CSCAPES:Combinatorial Scientific Computing and Petascale Simulation
  • A SciDAC Institute Funded by DOE’s Office of Science

Alex Pothen,

Florin Dobrian,

Assefaw Gebremedhin

Erik Boman,

Karen Devine,

Bruce Hendrickson

Paul Hovland,

Boyana Norris,

Jean Utke

Michelle Strout

Umit Catalyurek

www.cscapes.org