Auto-Tuning for High-Performance Computing

Performance Engineering Research Institute (DOE SciDAC) Katherine Yelick LBNL and UC Berkeley

Performance Engineering Enabling Petascale Science • Performance engineering is getting harder: • Systems are more complicated • O(100K) processing nodes • Multi-core with SIMD extensions • Applications are more complicated • Multi-disciplinary and multi-scale • PERI approach: • Modeling: performance prediction • Application engagement: assist in performance engineering • Automatic performance tuning: tools to improve performance IBM BlueGene at LLNL Cray Xt3 at ORNL POP model of El Nino Beam3D accelerator modeling

Engaging SciDAC Software Developers • Application Engagement • Work with DOE computational scientists • Ensure successful performance porting of scientific software • Focus PERI research on real problems • Application Liaisons • Build long-term personal relationships • Tiger Teams • Focus on DOE’s highest priorities • SciDAC-2 applications • LCF Pioneering applications • INCITE applications Optimizing arithmetic kernels Maximizing scientific throughput

External Software Automatic Performance Tuning of Scientific Code • Guidance • measurements • models • hardware information • sample inpu • annotations • assertions Source Code Long-term goals of PERI • Automate the process of tuning • Improve performance portability • Address performance expert shortage; replace human time by computer time • Build on 40 year of human experience and recent success with auto tuned libraries (Atlas, FFTW, OSKI) Triage Analysis Domain-Specific Code Generation Transformations - Specific Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database PERI automatic tuning framework

Participating Institutions Lead PI: Bob Lucas Institutions: Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Oak Ridge National Laboratory Rice University University of California at San Diego University of Maryland University of North Carolina University of Southern California University of Tennessee

Major Tuning Activities in PERI • Triage: discover tuning targets • Identifying bottlenecks (HPC Toolkit) • Use hardware events (PAPI) • Library-based tuning • Dense linear algebra (Atlas) • Sparse linear algebra (OSKI) • Stencil operations • Application-based tuning • Parameterized applications (Active Harmony) • Automatic source-based tuning (Rose and CG)

Triage Tools:HPC Toolkit Goal: Discover tuning opportunities (Rice) Features of HPC Toolkit: • Ease of use • No manual code instrumentation; handle large multi-lingual codes • Perform detailed measurements • Both communication and computation • Many granularities: node, core, procedure, loop, and statement • Identify inefficiencies in code: • Parallel inefficiencies: load imbalance, communication overhead, etc. • Computation inefficiencies: pipeline stalls, memory bottlenecks, etc.

On-line Hardware Monitoring: PAPI Goal: machine-independent Performance API (UTK) • Multi-substrate support recently added to PAPI • Enables simultaneous monitoring of • On-processor counters • Off-processor counters (e.g., network counters) • Temperature sensors • Heterogeneous multi-core hybrid systems • Online monitoring will help enable runtime tuning

Dense Linear Algebra:Atlas Goal: Auto-tuning for dense linear algebra (UTK) Atlas features and plans: • Performance portability across processors • Massively multi-threaded and multi-core architectures, which requires • Asynchrony (e.g., lookahead) • Modern vectorization (SIMD extensions) • Hiding of memory latency • Overlap of communication with computation • Hand techniques being automated • Better search algorithms and parallel search

Sparse Linear Algebra • OSKI: Optimized Sparse Kernel Interface (Berkeley) • Extra work can improve performance • Cannot make decisions offline: need matrix structure • Example: • Pad 3x3 blocks with zeros • “Fill ratio” = 1.5 • PIII speedup: 1.5x Joint work with Bebop group

Optimizations Available in OSKI • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels (focus for new work) • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB • New: vector and multicore support, better code generation Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Proof-of-Concept Results • Recent work by Rich Vuduc • Integration of OSKI into PETSc • Example matrix: Accelerator cavity design • N ~ 1 M, ~40 M non-zeros (SLAC matrix) • 2x2 dense block substructure • Uses register blocking and symmetry • Improves performance of local computation • Preliminary speedup numbers: • 8 node Xeon cluster • Speedup: 1.6x Joint work with Bebop group, see R. Vuduc PhD thesis

Stencil Computations • Stencils have simple inner loops • Typically ~1 FLOP per load • Run at small fraction of peak (<15%)! • Strategies: minimize cache misses • Cache blocked within 1 sweep • Time skewed (and blocked): merge across iterations • Cache oblivious: recursive blocking across iterations • Observations: • Iteration merging only works in some algorithms! • Reducing misses does not always minimize time • Prefetch is as important as caches (unit stride runs) • Difference between 1D, 2D, and 3D results in practice Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Cache-Optimized Stencils

Tuning for the Cell Architecture • Cell will be used in the PS3  high volume • Current system problems: • Off-chip bandwidth and power • Double precision floating point interface • ~14x slower than single • Problem for computationally intensive kernels (BLAS3) • Consider a variation call Cell+ that fixes this • Memory system • Software controlled memory (like explicit out of core) • Improves bandwidth and power usage • But increases programming complexity Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

Scientific Kernels on Cell (double precision) 55 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

User-Assisted Runtime Performance Optimization • Active Harmony: Runtime optimization (UMD) • Automatic library selection (code) • Monitor library performance • Switch library if necessary • Automatic performance tuning (parameter) • Monitor system performance • Adjust runtime parameters • Results • Cluster-based web service – up to 16% improvement • POP – up to 17% improvement • GS2 – up to 3.4x faster • New: improved search algorithms • Tuning of component-based software (ANL)

Active Harmony ExampleParallel Ocean Program (POP) • Parameterized over block dimension • Problem size – 3600x2400 on 480 processors (NERSC IBM SP - seaborg) • Up to 15% improvement in execution time

Original Program void main(){ Call OutlineFunc((<InputParameters>) } void OutlineFunc(<InputParameters>){ } Isolated Program <InputParameters>=SetInitialDataValues StoreInitialDataValues CaptureMachineState SetMachineState Code Fragment to be executed Isolated code Kernel Extraction (Code Isolator) AUTOMATED IN SUIF MANUAL AUTOMATE IN OPEN64 LCPC ‘04

ROSE Project • Software analysis and optimization for scientific applications • Tool for building source-to-source translators • Support for C and C++ • F90 in development • Loop optimizations • Performance analysis • Software engineering • Lab, academic, and industry use • Domain-specific analysis and optimizations • Development of new optimization approaches • Optimization of object-oriented abstractions

Source Based Optimizationsin Rose • Robustness (handles real lab applications): • Kull (ASC), ALE3D (ASC), Ares (ASC, in progress), hyper, IRS (ASC, benchmark), Khola (ASC, benchmark), CHOMBO (LBL AMR framework), ROSE (compiles itself, in progress) • Ten separate million line applications (current work) • Custom Application Analysis: • Analysis to find and remove static class initialization • Custom Loop Classification for Kull loop structures • Analysis for combined procedure inlining, code motion, and loop fusion for Kull loop structures (optimization done by hand in one loop by Brian Miller, automating search for other opportunities in Kull (current work). • Optimization: • Demonstrated data structure splitting for Kull (2X improvement on a Kull benchmark; implemented by hand now in Kull)

Empirical Optimization(loop fusion w/ Ken Kennedy + students) • Empirical Evaluation of hundreds of loop fusion options for the Hyperbolic PPM Scheme (~50 loops) (PPM by Woodward/Colella) • Uses ROSE loop optimizer with parameters to control loop fusion • Static Evaluation • Faster Performance • Slower to generate • Dynamic Evaluation • Slower performance • Correlated to cleanly generated code • 10X faster to evaluate search space Un-fused loops for(i = 0; i < size; i++) a[i] = c[i] + d; for(i = 0; i < size; i++) b[i] = c[i] – d; Fused loops for(i = 0; i < size; i++) { a[i] = c[i] + d; b[i] = c[i] – d; }

Vendor BLAS ATLAS BLAS Native ECO Matrix Multiply: Comparison of ECO, ATLAS, vendor BLAS and compiler matrix multiply on SGI R10K

Summary • Many solved and open problems in automatic tuning • Berkeley-specific activities • OSKI: extra floating point work can save time • Stencil tuning: beware of prefetch • New architectures: vectors, Cell, integration with PETSc for clusters • PERI • Basic auto-tuning framework • Library and application level tuning; online and offline • Source transformations and domain specific generators • Many forms of “guidance” to control optimizations • Performance modeling and application engagement too • Opportunities to collaborate

External Software • Guidance • models • hardware information • annotations • assertions Source Code PERI Automatic Tuning Tools Triage Analysis Domain-Specific Code Generation Transformations - Specific www.peri-scidac.org Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database

Runtime Tuning with Components Tuning of component-based software (Norris & Hovland, ANL) • Initial implementation of intra-component performance analysis for CQoS (FY08, Q1) • Intra-component analysis for generating performance models of single components (FY08, Q4) • Define specification for CQoS support for component SciDAC apps (FY09, Q1)

Source-Based Empirical Optimization Source-based optimization (Quinlan/LLNL, Hall/ISI) • Combine Model-guided and empirical optimization • compiler models prune unprofitable solutions • empirical data provide accurate measure of optimization impact • Supporting framework • kernel extraction tools (code isolator) • Prototypes for C/C++ and F90 • experience base to maintain previous results (later years) • More talks on these projects later today

FY07 Plan for Source Tuning (USC) • 1. From proposal: • “Develop optimizations for imperfectly nested loops” • STATUS: New transformation framework underway, uses Omega • 2. Nearer term milestone for out-year deliverable • Frontend to kernel extraction tool in Open64 • PLAN: Instrument original application code to collect loop bounds, control flow and input data • 3. New! • Targeting locality + multimedia extension architectures (AltiVec and SSE3) • STATUS: Preliminary MM results on AltiVec, working on SSE3 • 4. Need help for out-year milestone! • Apply to “selected loops in SciDAC applications” • Plan for identifying these?

Source-Based Tuning (LLNL)

Tuning (UNC)

Pop Quiz • What are: • HPCToolkit • Rose • BeBOP • Active Harmony • PAPI • Atlas • Eco • OSKI • Should we have an index for PERI portal? • 1-sentence description of each tool and relationship to PERI (if any) • Is google good enough?

Challenges • Technical challenges • Multicore, etc.: • This is under control (modulo inability to control SMPs) • Would do well to target key apps kernels • Scaling, communication, load imbalance: • Less experience here, but some results for communication tuning • Load imbalance is likely to be an app-level problem • Management challenges • Tuning core community is as described • Minor: Mary and Dan need to work closely • Lots of “outer circle” tuning activities • Relationship to modeling • Identify specific opportunities

PERI Tuning • Motivation: • Hand-tuning is too time-consuming, and is not robust… • Especially as we move towards Petascale • Topology may matter, multi-core memory systems are complicated, memory and network latency are not getting better • Solution: automatic performance tuning • Use tools to identified tuning opportunities • Build apps to be auto-tunable by parameters + tool • Use auto-tuned libraries in applications • Tune full applications using source-to-source transforms

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Performance: Accel. Cavity

Stanza Triad • Even smaller benchmark for prefetching • Derived from STREAM Triad • Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements . . . . . . 1) do L triads 2) skip k elements 3) do L triads stanza stanza Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Results • Without prefetching: • performance would be independent of stanza length; flat line at STREAM peak • our results show performance depends on stanza length Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Cost Model for Stanza Triad • First cache line in every L-element stanza is not prefetched • assign cost Cnon-prefetched • get value from Stanza Triad with L=cache line size • The rest of the cache lines are prefetched • assign cost Cprefetched • value from Stanza Triad with large L • Total Cost: Cost = #non-prefetched * Cnon-prefetched + #prefetched * Cprefetched Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Model Works well, except Itanium2 Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Memory Model 2 • Instead of 2 pt piecewise function, use 3 pts • Models all 3 architectures Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Cost Model for Cache Blocking Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Probe Cost Model Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Cache Blocking Summary Speedups only with • large grid sizes • unblocked unit-stride dimension • Currently applying to cross-iteration optimizations Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Auto-Tuning for High-Performance Computing