Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. November 2004

Outline • Motivation for Automatic Performance Tuning • Recent results for sparse matrix kernels • Application to T3P, Omega3P • OSKI = Optimized Sparse Kernel Interface • Future Work

Prizes • Best Paper, Intern. Conf. Parallel Processing, 2004 • “Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” • Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries,2003 • Best Student Presentation too, to Richard Vuduc • “Automatic performance tuning and analysis of sparse triangular solve” • Finalist, Best Student Paper, Supercomputing 2002 • To Richard Vuduc • “Performance Optimization and Bounds for Sparse Matrix-vector Multiply” • Best Presentation Prize, MICRO-33: 3rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000 • To Richard Vuduc • “Statistical Modeling of Feedback Data in an Automatic Tuning System”

Motivation for Automatic Performance Tuning • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • 2x faster than CSR with “hand-tuning” • Tuning becoming more difficult over time • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Our approach: empirical modeling and search • Up to 4x speedups and 31% of peak for SpMV • Many optimization techniques for SpMV • Several other kernels: triangular solve, ATA*x, Ak*x • Proof-of-concept: Integrate with Omega3P • Release OSKI Library, integrate into PETSc

Example: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure

Best: 4x2 Reference Speedups on Itanium 2: The Need for Search Mflop/s Mflop/s

Ultra 2i - 9% 63 Mflop/s Ultra 3 - 6% 109 Mflop/s SpMV Performance (Matrix #2): Generation 2 35 Mflop/s 53 Mflop/s Pentium III - 19% Pentium III-M - 15% 96 Mflop/s 120 Mflop/s 42 Mflop/s 58 Mflop/s

Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s 103 Mflop/s 276 Mflop/s

Opteron Performance Profile

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup!

Summary of Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.2x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB

Potential Impact on Applications: T3P • Source: SLAC [Ko] • 80% of time spent in SpMV • Relevant optimization techniques • Symmetric storage • Register blocking • On Single Processor Itanium 2 • 1.68x speedup • 532 Mflops, or 15% of 3.6 GFlop peak • 4.4x speedup with 8 multiple vectors • 1380 Mflops, or 38% of peak

Potential Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • Nodes = columns of A • Weights(u, v) = no. of nz u, v have in common • Tour = ordering of columns • Choose maximum weight tour • See [Pinar & Heath ’97] • 2x speedup on Itanium 2, but SPMV not dominant

Source: Accelerator Cavity Design Problem (Ko via Husbands)

100x100 Submatrix Along Diagonal

Post-RCM Reordering

“Microscopic” Effect of RCM Reordering Before: Green + Red After: Green + Blue

“Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

Optimized Sparse Kernel Interface - OSKI • Provides sparse kernels automatically tuned for user’s matrix & machine • BLAS-style functionality: SpMV.,TrSV, … • Hides complexity of run-time tuning • Includes new, faster locality-aware kernels: ATA*x, … • Faster than standard implementations • Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x • For “advanced” users & solver library writers • Available as stand-alone library (Oct ’04) • Available as PETSc extension (Dec ’04) • Lines of code: ?? written by us, ?? generated

How the OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

How the OSKI Tunes (Overview) • At library build/install-time • Pre-generate and compile code variants into dynamic libraries • Collect benchmark data • Measures and records speed of possible sparse data structure and code variants on target architecture • Installation process uses standard, portable GNU AutoTools • At run-time • Library “tunes” using heuristic models • Models analyze user’s matrix & benchmark data to choose optimized data structure and code • Non-trivial tuning cost: up to ~40 mat-vecs • Library limits the time it spends tuning based on estimated workload • provided by user or inferred by library • User may reduce cost by save tuning results for application on future runs with same or similar matrix

Optimizations in the Initial OSKI Release • Fully automatic heuristics for • Sparse matrix-vector multiply • Register-level blocking • Register-level blocking + symmetry + multiple vectors • Cache-level blocking • Sparse triangular solve with register-level blocking and “switch-to-dense” optimization • Sparse ATA*x with register-level blocking • User may select other optimizations manually • Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x) • All available in dynamic libraries • Accessible via high-level embedded script language • “Plug-in” extensibility • Very advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

Extra Slides

Example: Combining Optimizations • Register blocking, symmetry, multiple (k) vectors • Three low-level tuning parameters: r, c, v X k v * r c += Y A

Example: Combining Optimizations • Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] • Symmetric, blocked, 1 vector • Up to 2.6x over nonsymmetric, blocked, 1 vector • Symmetric, blocked, k vectors • Up to 2.1x over nonsymmetric, blocked, k vecs. • Up to 7.3x over nonsymmetric, nonblocked, 1, vector • Symmetric Storage: 64.7% savings

Current Work • Public software release • Impact on library designs: Sparse BLAS, Trilinos, PETSc, … • Integration in large-scale applications • DOE: Accelerator design; plasma physics • Geophysical simulation based on Block Lanczos (ATA*X; LBL) • Systematic heuristics for data structure selection? • Evaluation of emerging architectures • Revisiting vector micros • Other sparse kernels • Matrix triple products, Ak*x • Parallelism • Sparse benchmarks (with UTK) [Gahvari & Hoemmen] • Automatic tuning of MPI collective ops [Nishtala, et al.]

Review of Tuning by Illustration (Extra Slides)

Splitting for Variable Blocks and Diagonals • Decompose A = A1 + A2 + … At • Detect “canonical” structures (sampling) • Split • Tune each Ai • Improve performance and save storage • New data structures • Unaligned block CSR • Relax alignment in rows & columns • Row-segmented diagonals

Example: Variable Block Row (Matrix #12) 2.1x over CSR 1.8x over RB

Example: Row-Segmented Diagonals 2x over CSR

Mixed Diagonal and Block Structure

Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Example: Sparse Triangular Factor

“axpy” dot product Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT • Register-level: aiT to be r´c block row, or diag row • Algorithmic-level transformations for A2*x, A3*x, …

Summary • Automated block size selection • Empirical modeling and search • Register blocking for SpMV, triangular solve, ATA*x • Not fully automated • Given a matrix, select splittings and transformations • Lots of combinatorial problems • TSP reordering to create dense blocks (Pinar ’97; Moon, et al. ’04)

Extra Slides

A Sparse Matrix You Encounter Every Day Who am I? I am a Big Repository Of useful And useless Facts alike. Who am I? (Hint: Not your e-mail inbox.)

Problem Context • Sparse kernels abound • Models of buildings, cars, bridges, economies, … • Google PageRank algorithm • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak • 2x faster with “hand-tuning” • Tuning becoming more difficult over time • Promise of automatic tuning: PHiPAC/ATLAS, FFTW, … • Challenges to high-performance • Not dense linear algebra! • Complex data structures: indirect, irregular memory access • Performance depends strongly on run-time inputs

Key Questions, Ideas, Conclusions • How to tune basic sparse kernels automatically? • Empirical modeling and search • Up to 4x speedups for SpMV • 1.8x for triangular solve • 4x for ATA*x; 2x for A2*x • 7x for multiple vectors • What are the fundamental limits on performance? • Kernel-, machine-, and matrix-specific upper bounds • Achieve 75% or more for SpMV, limiting low-level tuning • Consequences for architecture? • General techniques for empirical search-based tuning? • Statistical models of performance

Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • Upper bounds on performance • Statistical models of performance

Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • Upper bounds on performance • Statistical models of performance

Historical Trends in SpMV Performance • The Data • Uniprocessor SpMV performance since 1987 • “Untuned” and “Tuned” implementations • Cache-based superscalar micros; some vectors • LINPACK

SpMV Historical Trends: Mflop/s

SpMV Historical Trends: Fraction of Peak

Example: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

Still More Surprises • More complicated non-zero structure in general

Still More Surprises • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells

Historical Trends: Mixed News • Observations • Good news: Moore’s law like behavior • Bad news: “Untuned” is 10% peak or less, worsening • Good news: “Tuned” roughly 2x better today, and improving • Bad news: Tuning is complex • (Not really news: SpMV is not LINPACK) • Questions • Application: Automatic tuning? • Architect: What machines are good for SpMV?

Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques • SpMV [SC’02; IJHPCA ’04b] • Sparse triangular solve (SpTS) [ICS/POHLL ’02] • ATA*x [ICCS/WoPLA ’03] • Upper bounds on performance • Statistical models of performance

Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

Presentation Transcript

Site Report: Lawrence Berkeley National Laboratory

Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley

Don Medley Lawrence Berkeley National Laboratory Berkeley, California Thursday, June 25, 2009

Lawrence Berkeley National Lab

Lawrence Berkeley National Lab

Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

Carl Haber Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory April 30, 2009

John L. McCarthy XMDR Project Lawrence Berkeley National Laboratory

Beate Heinemann University of California, Berkeley and Lawrence Berkeley National Laboratory

Simon Clark Advanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA94720

Lawrence Berkeley National Laboratory MicroSystems Laboratory

Galen Barbose and Pete Larsen Lawrence Berkeley National Laboratory

Beate Heinemann University of California, Berkeley and Lawrence Berkeley National Laboratory

Tony Drummond Lawrence Berkeley National Laboratory LADrummond@lbl

LAWRENCE BERKELEY NATIONAL LABORATORY

Lawrence Berkeley National Lab

Lawrence Berkeley National Laboratory ASO Internal Safety Review System

Leader, China Energy Group, Lawrence Berkeley National Laboratory

Jonathan Koomey, Ph.D. Lawrence Berkeley National Laboratory

Peter Jacobs Lawrence Berkeley National Laboratory

Katherine Yelick, BIPS Director Lawrence Berkeley National Laboratory and