1 / 159

Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS). Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. November 2004. Outline. Motivation for Automatic Performance Tuning

lynda
Download Presentation

Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. November 2004

  2. Outline • Motivation for Automatic Performance Tuning • Recent results for sparse matrix kernels • Application to T3P, Omega3P • OSKI = Optimized Sparse Kernel Interface • Future Work

  3. Prizes • Best Paper, Intern. Conf. Parallel Processing, 2004 • “Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” • Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries,2003 • Best Student Presentation too, to Richard Vuduc • “Automatic performance tuning and analysis of sparse triangular solve” • Finalist, Best Student Paper, Supercomputing 2002 • To Richard Vuduc • “Performance Optimization and Bounds for Sparse Matrix-vector Multiply” • Best Presentation Prize, MICRO-33: 3rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000 • To Richard Vuduc • “Statistical Modeling of Feedback Data in an Automatic Tuning System”

  4. Motivation for Automatic Performance Tuning • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • 2x faster than CSR with “hand-tuning” • Tuning becoming more difficult over time • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Our approach: empirical modeling and search • Up to 4x speedups and 31% of peak for SpMV • Many optimization techniques for SpMV • Several other kernels: triangular solve, ATA*x, Ak*x • Proof-of-concept: Integrate with Omega3P • Release OSKI Library, integrate into PETSc

  5. Example: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure

  6. Best: 4x2 Reference Speedups on Itanium 2: The Need for Search Mflop/s Mflop/s

  7. Ultra 2i - 9% 63 Mflop/s Ultra 3 - 6% 109 Mflop/s SpMV Performance (Matrix #2): Generation 2 35 Mflop/s 53 Mflop/s Pentium III - 19% Pentium III-M - 15% 96 Mflop/s 120 Mflop/s 42 Mflop/s 58 Mflop/s

  8. Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s 103 Mflop/s 276 Mflop/s

  9. Opteron Performance Profile

  10. Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup!

  11. Summary of Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.2x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB

  12. Potential Impact on Applications: T3P • Source: SLAC [Ko] • 80% of time spent in SpMV • Relevant optimization techniques • Symmetric storage • Register blocking • On Single Processor Itanium 2 • 1.68x speedup • 532 Mflops, or 15% of 3.6 GFlop peak • 4.4x speedup with 8 multiple vectors • 1380 Mflops, or 38% of peak

  13. Potential Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • Nodes = columns of A • Weights(u, v) = no. of nz u, v have in common • Tour = ordering of columns • Choose maximum weight tour • See [Pinar & Heath ’97] • 2x speedup on Itanium 2, but SPMV not dominant

  14. Source: Accelerator Cavity Design Problem (Ko via Husbands)

  15. 100x100 Submatrix Along Diagonal

  16. Post-RCM Reordering

  17. “Microscopic” Effect of RCM Reordering Before: Green + Red After: Green + Blue

  18. “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

  19. Optimized Sparse Kernel Interface - OSKI • Provides sparse kernels automatically tuned for user’s matrix & machine • BLAS-style functionality: SpMV.,TrSV, … • Hides complexity of run-time tuning • Includes new, faster locality-aware kernels: ATA*x, … • Faster than standard implementations • Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x • For “advanced” users & solver library writers • Available as stand-alone library (Oct ’04) • Available as PETSc extension (Dec ’04) • Lines of code: ?? written by us, ?? generated

  20. How the OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

  21. How the OSKI Tunes (Overview) • At library build/install-time • Pre-generate and compile code variants into dynamic libraries • Collect benchmark data • Measures and records speed of possible sparse data structure and code variants on target architecture • Installation process uses standard, portable GNU AutoTools • At run-time • Library “tunes” using heuristic models • Models analyze user’s matrix & benchmark data to choose optimized data structure and code • Non-trivial tuning cost: up to ~40 mat-vecs • Library limits the time it spends tuning based on estimated workload • provided by user or inferred by library • User may reduce cost by save tuning results for application on future runs with same or similar matrix

  22. Optimizations in the Initial OSKI Release • Fully automatic heuristics for • Sparse matrix-vector multiply • Register-level blocking • Register-level blocking + symmetry + multiple vectors • Cache-level blocking • Sparse triangular solve with register-level blocking and “switch-to-dense” optimization • Sparse ATA*x with register-level blocking • User may select other optimizations manually • Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x) • All available in dynamic libraries • Accessible via high-level embedded script language • “Plug-in” extensibility • Very advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

  23. Extra Slides

  24. Example: Combining Optimizations • Register blocking, symmetry, multiple (k) vectors • Three low-level tuning parameters: r, c, v X k v * r c += Y A

  25. Example: Combining Optimizations • Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] • Symmetric, blocked, 1 vector • Up to 2.6x over nonsymmetric, blocked, 1 vector • Symmetric, blocked, k vectors • Up to 2.1x over nonsymmetric, blocked, k vecs. • Up to 7.3x over nonsymmetric, nonblocked, 1, vector • Symmetric Storage: 64.7% savings

  26. Current Work • Public software release • Impact on library designs: Sparse BLAS, Trilinos, PETSc, … • Integration in large-scale applications • DOE: Accelerator design; plasma physics • Geophysical simulation based on Block Lanczos (ATA*X; LBL) • Systematic heuristics for data structure selection? • Evaluation of emerging architectures • Revisiting vector micros • Other sparse kernels • Matrix triple products, Ak*x • Parallelism • Sparse benchmarks (with UTK) [Gahvari & Hoemmen] • Automatic tuning of MPI collective ops [Nishtala, et al.]

  27. Review of Tuning by Illustration (Extra Slides)

  28. Splitting for Variable Blocks and Diagonals • Decompose A = A1 + A2 + … At • Detect “canonical” structures (sampling) • Split • Tune each Ai • Improve performance and save storage • New data structures • Unaligned block CSR • Relax alignment in rows & columns • Row-segmented diagonals

  29. Example: Variable Block Row (Matrix #12) 2.1x over CSR 1.8x over RB

  30. Example: Row-Segmented Diagonals 2x over CSR

  31. Mixed Diagonal and Block Structure

  32. Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Example: Sparse Triangular Factor

  33. “axpy” dot product Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT • Register-level: aiT to be r´c block row, or diag row • Algorithmic-level transformations for A2*x, A3*x, …

  34. Summary • Automated block size selection • Empirical modeling and search • Register blocking for SpMV, triangular solve, ATA*x • Not fully automated • Given a matrix, select splittings and transformations • Lots of combinatorial problems • TSP reordering to create dense blocks (Pinar ’97; Moon, et al. ’04)

  35. Extra Slides

  36. A Sparse Matrix You Encounter Every Day Who am I? I am a Big Repository Of useful And useless Facts alike. Who am I? (Hint: Not your e-mail inbox.)

  37. Problem Context • Sparse kernels abound • Models of buildings, cars, bridges, economies, … • Google PageRank algorithm • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak • 2x faster with “hand-tuning” • Tuning becoming more difficult over time • Promise of automatic tuning: PHiPAC/ATLAS, FFTW, … • Challenges to high-performance • Not dense linear algebra! • Complex data structures: indirect, irregular memory access • Performance depends strongly on run-time inputs

  38. Key Questions, Ideas, Conclusions • How to tune basic sparse kernels automatically? • Empirical modeling and search • Up to 4x speedups for SpMV • 1.8x for triangular solve • 4x for ATA*x; 2x for A2*x • 7x for multiple vectors • What are the fundamental limits on performance? • Kernel-, machine-, and matrix-specific upper bounds • Achieve 75% or more for SpMV, limiting low-level tuning • Consequences for architecture? • General techniques for empirical search-based tuning? • Statistical models of performance

  39. Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • Upper bounds on performance • Statistical models of performance

  40. Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

  41. Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • Upper bounds on performance • Statistical models of performance

  42. Historical Trends in SpMV Performance • The Data • Uniprocessor SpMV performance since 1987 • “Untuned” and “Tuned” implementations • Cache-based superscalar micros; some vectors • LINPACK

  43. SpMV Historical Trends: Mflop/s

  44. SpMV Historical Trends: Fraction of Peak

  45. Example: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

  46. Still More Surprises • More complicated non-zero structure in general

  47. Still More Surprises • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells

  48. Historical Trends: Mixed News • Observations • Good news: Moore’s law like behavior • Bad news: “Untuned” is 10% peak or less, worsening • Good news: “Tuned” roughly 2x better today, and improving • Bad news: Tuning is complex • (Not really news: SpMV is not LINPACK) • Questions • Application: Automatic tuning? • Architect: What machines are good for SpMV?

  49. Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • SpMV [SC’02; IJHPCA ’04b] • Sparse triangular solve (SpTS) [ICS/POHLL ’02] • ATA*x [ICCS/WoPLA ’03] • Upper bounds on performance • Statistical models of performance

More Related