Bob Lucas University of Southern California Sept. 23, 2011

Science Pipeline Allen D. MalonyUniversity of Oregon May 6, 2014 Bob Lucas University of Southern California Sept. 23, 2011 Support for this workwas provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

Fundamental Objectives • SUPER funnels the rich intellectual products borne from a history of research and development in performance areas into an effective performance engineering center of mass for the SciDACprogram • SUPER pulls from prior investments by ASCR and others the technology and expertise that past efforts produced, especially with respect to methodologies, tools, and integration across performance engineering areas • measurement, analysis, modeling • program analysis, optimization and tuning • resilience • SUPER focuses on integration of expertise for addressing performance engineering problems across the SciDAC landscape, leveraging the robust performance tools available

Pipeline to Tools/Technology Integration and Application Tools / Technologies DOE funding Autotuning TAU Integration PAPI Performance End-to-end mpiP Optimization GPTL RCRToolkit Modeling PBound SciDACapplications Roofline Reliability PEBIL PSiNtracer Code analysis ROSE Resilience CHiLL ActiveHarmony Autotuning Orio Energy Center of mass forperformance engineerng Other funding

Performance Engineering Tools/Tech Integration SUPER focuses on integrating developed tools and technologies to build enhanced capabilities

End-to-endPerformance Optimization SUPER is establishing processes for applying integrated tools for end-to-end optimization

Tools and Technologies • Performance • TAU Performance System, PAPI, mpiP, GPTL • Power / Energy • PEBIL, PSiNtracer • Autotuning • Active Harmony, CHiLL, Orio • Resilience and source analysis • Modeling • Pbound, Roofline • Optimization

TAU Performance System • Tuning and Analysis Utilities (20+ year project) • Performance problem solving framework for HPC • Integrated performance toolkit • Multi-level performance instrumentation • Flexible and configurable performance measurement • Widely-ported performance profiling / tracing system • Performance data management and data mining • Open source (BSD-style license) • Broad use in complex software, systems, applications • Long history of funding by DOE, NSF, and DoD

TAU’s Funding and Development Pipeline • Flexible performance measurements, • Performance mapping in software layers Evolution Funding pipeline: 2001 – 2011 Automated source instrumentationModeling and computational QoS CCA Kernel-level measurement Runtime scalable monitoring ZeptoOS Source code analysis (PDT) Performance data management (PerfDMF) Productive Performance knowledgePerformance data mining (PerfExplorer) Knowledge DOE Measurement infrastructure refactorTAU + Scalasca Score-P PRIMA Parallel performance visualizationAutomatic library wrapping MOGO Heterogeneous performance Accelerator analysis Vancouver Open source interoperation Performance engineering POINT NSF Cross-layer Integration Glassbox

TAU Technologies TAU + Scalasca Score-P TAUdb PerfExplorer ParaProf

Impact of TAUdb and PerfExplorer ROSE CHiLL+ AH OpenCL CUDA Orio TAUdb MPAS-O Geant4 XGC1 PerfExplorer CESM

End-to-End Performance Variability Analysis (CESM) • Use of GPTL • (General PurposeTiming Library) • Lightweight profilingto bundle with app • NSF + DOE funding • Couple with platform systems information • TAUdb extended tosupport this data

Geant4 Performance Analysis and Tuning • Geant 4 is extremely important to the design and execution of HEG experiments • How to evolve design to best exploitcurrent/future architectures? • Geant4 tHEPand ASCR partnership • Not a standard performanceanalysis/tuning scenario • Quantifying performance impact ofOO design choices • Class-based performance analysis • polymorphism (same function name, many implementations) • virtual functions (what object types are functions invoked on?)

Using TAU in Geant4 • TAU collects data for Simplified Calorimeter experiment • Sampling profiles: low-overhead measurements of full-scale experiments • Instrumentation-based: selectively instrumented classes and functions to collect precise measurements for functions (and whole classes) identified through sampling • Data stored in TAUdb(shared with physics collaborators) • New analysis enabled by TAUdb and PerfExplorer • Class-based profiles: hardware counters and derived metrics • Compare impacts of high-level and low-level optimizations • changing inheritance structure (design) (high) • performance metrics (cache misses, vectorization, …) (low)

Performance API (PAPI) • PAPI is middleware that provides a consistent interface and methodology for the performance counter hardware in major microprocessors • PAPI enables software engineersto see the relation betweensoftware performance andhardware events • PAPI component architectureprovides access to a collectionof components that exposeperformance measurementopportunities across the system • network, I/O system, accelerators,power/energy

PAPI Pipeline PaRSEC (UTK) TAU (UO) PerfSuite (NCSA)HPCToolkit (Rice) SCALASCA (FZJ, UTK)VampirVampir (TUD) Open|Speedshop (LLNL)SvPablo (RENCI) • DOE support • ASCR (2002-05) • PERC (2001-06) • PERI (2006-11) • PAPI is widelyavailable onprocessors and isheavily used inSUPER across areas

Performance Analysis for Communication (mpiP) LAMMPS LULESH • Lightweight and scalable profiling toolfor MPI applications • DOE funding history • ASC, PERC, PERI • SUPER is extending mpiP to collectcommunication topology information forpoint-to-point and collective communication • SciDAC application characterization studies • Benchmarks and applications fromDOE-funded Oxbow project • Developing an automated approach forcharacterizing the communication topology

Analyzing and Modeling Performance and Power How can we get energy efficient HPC? Understand and model how computation and communication affect the overall performance and energy requirements of HPC applications Use performance and power models to design software and hardware-aware “green” techniques to optimize energy footprint PEBIL and PSiNtracer (PMaC Labs) RCRToolkit (RENCI)

Analysis and Modelingwith PEBIL and PSiNstracer • Capture fundamental operations used by the application • Requires low-leve, specific details of application • Analysis required on large-scale production codes • PEBIL binary instrumentation • Static analysis (memory, FP counts, op parallelism, …) • Dynamic (cache hits, execution counts, loop length, …) • PSiNstracer communication characterization • Profiles all communication routines during a run • Funding heritage • DOE (ASCR, PERC, PERI) • DoD, NSF

RCRToolkit for Runtime Resource Monitoring • Resource Centric Reflection (RCR) Toolkit • Node-wide performance monitoring and analysis • Uncore (“outside the core”) • Access through shared blackboard (RCRblackboard) • Funding pipeline • DoD ACS MAESTRO and ATPER • DOE (XGC, XPRESS) • NSF GENI • Impact • Adaptive scheduling for power and energy • Target deterministic strategies for (auto)tuning • SciDACend applications amenable to using

Autotuning Pipeline • SUPER brings several research efforts together to enable the use and integration of automatic tuning methods and tools • Active Harmony (University of Maryland) • CHiLL (Utah, USC) • Orio (Argonne, UO) • Powerful capability for performance engineering • Parameter exploration automation • Couple with code transformation techniques • Impact can be significant in improving ability to explore multi-dimensional performance space

Active Harmony Evaluated Performance Active Harmony Client Application REPORT 3 2 1 Search Strategy 2 3 1 FETCH Candidate Points • Active Harmony (AH) is an auto-tuning framework that supports online and offline auto-tuning • Flexible, plugin-based architecture • How does it works? • Measures program performance • Adapts tunable parameters • Search heuristics explore options • Development funding pipeline • NSF (1997–2000) • DOD (1997–2000, 2010–present) • DOE (ASCR, 2001–2012) • DOE (SciDAC, 2001–present)

Active HarmonyIntegration • CHiLLintegration • Plugin used to access AH search methods • Explores performance space from code generation • TAU integration • Plugin used within AH to read from / write to TAUdb • TAU used with CHiLL and AH to capture performance • Application • Used with MPAS-O (partitioning optimization) • Developed new auto-tuned FFT (1.8x faster than FFTW)

CHiLLAutotuning Pipeline • CHiLLautotuning system developed in PERI (Utah) • Compiler framework for loop transformations • Integrated into the PERI autotuning framework • Integrated this in SciDACwithother research at Utah • Funding pipeline • NSF NGS (2002) • NSF CSR(2005) • DOE PERI (2006) • DOE ASCR XTUNE(2008) • Broadening the autotuning research agenda in SUPER • Heterogeneous systems • Other objectives, in particular energy and resilience

OrioAutotuning Framework • Express any properties of the computation that can possiblybe exploited to optimize • Orio approach • Optimization specifications • capture typical optimizations • tiling, unrolling, … • specialized implementations • different input sizes • Transform code based on knowledge • CUDA, OpenCL, OpenMP, … • Empirical analysis of variants (different code output) • Search for best • Orio integration with TAU for empirical autotuning\ • SUPER impact on PETSc and other libraries

Modeling through Source and Empirical Analysis • Performance bounds give the upper limit in performance that can be expected for a given application on a given system • Different existing approaches: • Fully automatic (ignores machine information) • Theoretical peak (based on FP units) • Fully dynamic (profiling-based, time, overhead) • Pboundapproach (Argonne) • Application signatures + architecture bounds • Roofline modeling (LBL)

PBound • Developed under PERC, PERI, and SUPER • ROSE-based tool that generates performance bounds from source code (C, C++, Fortran) • Example: what is the best achievable execution time? • Based on static (source code) analysis • Produces parameterized closed-form expressions expressing the computational and data load/store requirements of application kernels • Coupled with architectural information • Produces upper bounds on the performance of the application

Roofline Modeling • Roofline models characterize architectures and help visualize application performance within the architectural roofline • Shows the rangeof possible application performance • Determines how optimizations affect application performance • Performance space determined by either: • Static performance models • such as those generated by Pbound • Empirical models based upon platform experiments

Resilience Pipeline • Express knowledge of application requirements • Semiconductor Research Corporation (SRC) • MultiscaleSystems (MUSYC) Focused Center Research Program (FCRP) • New grant from ARO • Transition technology into the ROSE compiler (LLNL) • Create runtime system based on JPL technology • Additional NSF and SRC funding with Utah • Automatic derivation of predicates • Help detect silent errors • Hardware component based FPGAs • Use FPGAs as co-processors • Originally funded by DARPA under the ACS (Adaptive Computing Systems) • Work continues in SUPER • Collaborating with LLNL’s resilience research team • Broaden the space of applications and assertions

SUPER Science PipelineImpact and Outcomes Tools continue to improve and are widely distributed and downloaded 75 papers produced 35 presentations among the institutions 24 students matriculated and/or graduated 4 postdocs 10 internships at DOE national labs

Bob Lucas University of Southern California Sept. 23, 2011