1 / 32

High-Performance Computation for Path Problems in Graphs

High-Performance Computation for Path Problems in Graphs. Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications of Dynamical Systems May 20, 2009. Support: DOE Office of Science, MIT Lincoln Labs, NSF, DARPA, SGI.

nitza
Download Presentation

High-Performance Computation for Path Problems in Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications of Dynamical Systems May 20, 2009 Support: DOE Office of Science, MIT Lincoln Labs, NSF, DARPA, SGI

  2. Horizontal-vertical decomposition [Mezic et al.] Slide courtesy of Igor Mezic group, UCSB

  3. Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics

  4. A few directions in CSC • Hybrid discrete & continuous computations • Multiscale combinatorial computation • Analysis, management, and propagation of uncertainty • Economic & game-theoretic considerations • Computational biology & bioinformatics • Computational ecology • Knowledge discovery & machine learning • Relationship analysis • Web search and information retrieval • Sparse matrix methods • Geometric modeling • . . .

  5. The Parallel Computing Challenge Two Nvidia 8800 GPUs > 1 TFLOPS LANL / IBM Roadrunner > 1 PFLOPS Intel 80-core chip > 1 TFLOPS • Parallelism is no longer optional… • … in every part of a computation.

  6. The Parallel Computing Challenge • Efficient sequential algorithms for graph-theoretic problems often follow long chains of dependencies • Several parallelization strategies, but no silver bullet: • Partitioning (e.g. for preconditioning PDE solvers) • Pointer-jumping (e.g. for connected components) • Sometimes it just depends on what the input looks like • A few simple examples . . .

  7. Sample kernel: Sort logically triangular matrix • Used in sparse linear solvers (e.g. Matlab’s) • Simple kernel, abstracts many other graph operations (see next) • Sequential: linear time, simple greedy topological sort • Parallel: no known method is efficient in both work and span: one parallel step per level; arbitrarily long dependent chains Permuted to unit upper triangular form Original matrix

  8. Bipartite matching 1 2 3 4 5 1 4 1 5 2 2 3 3 3 1 4 2 4 5 PA 5 1 2 3 4 5 • Perfect matching: set of edges that hits each vertex exactly once • Matrix permutation to place nonzeros (or heavy elements) on diagonal • Efficient sequential algorithms based on augmenting paths • No known work/span efficient parallel algorithms 1 2 3 4 5 A

  9. Strongly connected components 1 2 4 7 5 3 6 1 2 4 7 5 3 6 2 1 • Symmetric permutation to block triangular form • Diagonal blocks are strong Hall (irreducible / strongly connected) • Sequential: linear time by depth-first search [Tarjan] • Parallel:divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar] 4 5 7 6 3 G(A) PAPT

  10. Horizontal - vertical decomposition 1 2 3 4 5 6 7 8 9 • Defined and studied by Mezic et al. in a dynamical systems context • Strongly connected components, ordered by levels of DAG • Efficient linear-time sequential algorithms • No work/span efficient parallel algorithms known 9 level 4 8 1 2 3 6 7 level 3 4 5 3 2 4 6 5 level 2 7 8 level 1 9 1

  11. Strong components of 1M-vertex RMAT graph

  12. Dulmage-Mendelsohn decomposition 1 2 1 2 3 4 5 6 7 8 9 10 11 1 1 3 HR HC 2 2 4 3 3 5 4 4 6 5 5 7 6 SR SC 7 6 8 8 7 9 9 8 10 10 9 11 VR VC 11 10 12 11 12

  13. Applications of D-M decomposition • Strongly connected components of directed graphs • Connected components of undirected graphs • Permutation to block triangular form for Ax=b • Minimum-size vertex cover of bipartite graphs • Extracting vertex separators from edge cuts for arbitrary graphs • Nonzero structure prediction for sparse matrix factorizations

  14. Strong Hall components are independent of choice of matching 4 1 7 2 5 6 3 1 2 4 7 5 3 6 1 1 1 2 2 2 3 3 4 4 4 7 5 5 5 3 6 6 6 7 7 1 1 1 2 4 7 5 3 6 2 2 3 3 4 4 5 5 6 6 7 7

  15. The Primitives Challenge Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n) C = A*B y = A*x μ = xT y • By analogy to numerical linear algebra. . . • What should the “combinatorial BLAS” look like?

  16. Primitives for HPC graph programming • Visitor-based multithreaded[Berry, Gregor, Hendrickson, Lumsdaine] + search templates natural for many algorithms + relatively simple load balancing – complex thread interactions, race conditions – unclear how applicable to standard architectures • Array-based data parallel [G, Kepner, Reinhardt, Robinson, Shah] + relatively simple control structure + user-friendly interface – some algorithms hard to express naturally – load balancing not so easy • Scan-based vectorized[Blelloch] • We don’t know the right set of primitives yet!

  17. Array-based graph algorithms study [Kepner, Fineman, Kahn, Robinson]

  18. Multiple-source breadth-first search 2 1 4 5 7 6 3 AT X

  19. Multiple-source breadth-first search 2 1 4 5 7 6 3  AT X ATX

  20. Multiple-source breadth-first search 2 1 4 5 7 6 3  • Sparse array representation => space efficient • Sparse matrix-matrix multiplication => work efficient • Span & load balance depend on matrix-mult implementation AT X ATX

  21. Matrices over semirings • Matrix multiplication C = AB(or matrix/vector): Ci,j = Ai,1B1,j+Ai,2B2,j+· · ·+Ai,nBn,j • Replace scalar operations and + by  : associative, distributes over, identity 1  : associative, commutative, identity 0 annihilates under  • ThenCi,j = Ai,1B1,j Ai,2B2,j  · · · Ai,nBn,j • Examples:(,+);(and,or);(+,min); . . . • Same data reference pattern and control flow

  22. SpGEMM: Sparse Matrix x Sparse Matrix [Buluc, G] • Shortest path calculations (APSP) • Betweenness centrality • BFS from multiple source vertices • Subgraph / submatrix indexing • Graph contraction • Cycle detection • Multigrid interpolation & restriction • Colored intersection searching • Applying constraints in finite element modeling • Context-free parsing

  23. Distributed-memory parallel sparse matrix multiplication • 2D block layout • Outer product formulation • Sequential “hypersparse” kernel j k k * • Asynchronous MPI-2 implementation • Experiments: TACC Lonestar cluster • Good scaling to 256 processors = i Cij Cij+= Aik* Bkj Time vs Number of cores -- 1M-vertex RMAT

  24. All-Pairs Shortest Paths • Directed graph with “costs” on edges • Find least-cost paths between all reachable vertex pairs • Several classical algorithms with • Work ~ matrix multiplication • Span ~ log2 n • Case study of implementation on multicore architecture: • graphics processing unit (GPU)

  25. GPU characteristics • Powerful: two Nvidia 8800s > 1 TFLOPS • Inexpensive: $500 each But: • Difficult programming model: One instruction stream drives 8 arithmetic units • Performance is counterintuitive and fragile: Memory access pattern has subtle effects on cost • Extremely easy to underutilize the device: Doing it wrong easily costs 100x in time

  26. Recursive All-Pairs Shortest Paths Based on R-Kleene algorithm Well suited for GPU architecture: • Fast matrix-multiply kernel • In-place computation => low memory bandwidth • Few, large MatMul calls => low GPU dispatch overhead • Recursion stack on host CPU, not on multicore GPU • Careful tuning of GPU code C A D B + is “min”, × is “add” A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

  27. Execution of Recursive APSP

  28. APSP: Experiments and observations 128-core Nvidia 8800 Speedup relative to. . . 1-core CPU: 120x – 480x 16-core CPU: 17x – 45x Iterative, 128-core GPU: 40x – 680x MSSSP, 128-core GPU: ~3x Time vs. Matrix Dimension Conclusions: • High performance is achievable but not simple • Carefully chosen and optimized primitives will be key

  29. H-V decomposition 1 2 3 4 5 6 7 8 9 A span-efficient, but not work-efficient, method for H-V decomposition uses APSP to determine reachability… 9 level 4 8 1 2 3 6 7 level 3 4 5 3 2 4 6 5 level 2 7 8 level 1 9 1

  30. Reachability: Transitive closure 1 2 3 4 5 6 7 8 9 9 level 4 8 1 2 3 6 7 level 3 4 5 3 2 4 6 5 level 2 7 8 level 1 9 1 APSP => transitive closure of adjacency matrix Strong components identified by symmetric nonzeros

  31. H-V structure: Acyclic condensation 9 level 4 8 1 2 3 6 8 9 1 2 67 level 3 3 6 8 2 345 level 2 9 level 1 1 Acyclic condensation is a sparse matrix-matrix product Levels identified by “APSP” for longest paths Practically speaking, a parallel method would compromise between work and span efficiency

  32. Remarks • Combinatorial algorithms are pervasive in scientific computing and will become more so. • Path computations on graphs are powerful tools, but efficiency is a challenge on parallel architectures. • Carefully chosen and implemented primitive operations are key. • Lots of exciting opportunities for research!

More Related