Bob Lucas University of Southern California Sept. 23, 2011

SUPER Science Pipeline Allen D. MalonyUniversity of Oregon Bob Lucas University of Southern California Sept. 23, 2011 Support for this workwas provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

Fundamental Objectives • SUPER funnels the rich intellectual products borne from a history of research and development in performance areas into an effective performance engineering center of mass for the SciDACprogram • SUPER pulls from prior investments by ASCR and others the technology and expertise that past efforts produced, especially with respect to methodologies, tools, and integration across performance engineering areas • measurement, analysis, modeling • program analysis, noptimizationand tuning • resilience • SUPER focuses on integration of expertise for addressing performance engineering problems across the SciDAC landscape, leveraging the robust performance tools available

Pipeline to Tools/Technology Integration and Application Tools / Technologies DOE funding Autotuning TAU Integration PAPI Performance End-to-end mpiP Optimization GPTL RCRToolkit Modeling PBound SciDACapplications Roofline Reliability PEBIL PSiNtracer Code analysis ROSE Resilience CHiLL ActiveHarmony Autotuning Orio Energy Center of mass forperformance engineerng Other funding

Performance Engineering Tools/Tech Integration SUPER focuses on integrating developed tools and technologies to build enhanced capabilities

End-to-endPerformance Optimization SUPER is establishing processes for applying integrated tools for end-to-end optimization

TAU Performance System • Tuning and Analysis Utilities (20+ year project) • Performance problem solving framework for HPC • Integrated, scalable, flexible, portable • Target all parallel programming / execution paradigms • Integrated performance toolkit • Multi-level performance instrumentation • Flexible and configurable performance measurement • Widely-ported performance profiling / tracing system • Performance data management and data mining • Open source (BSD-style license) • Broad use in complex software, systems, applications • Long history of funding by DOE, NSF, and DoD

TAU Development Pipeline “Performance Engineering Technology for Scientific Component Software,” Dept. of Energy, Office of Science / MICS, DE-FG02-03ER25561, A. Malony (PI), S. Shende (co-PI), 8/15/03 - 8/14/06 (extended to 2/1/07). “Extreme Performance Scalable Operating Systems,” Office of Science, DE-FG02-08ER25846, P. Beckman (PI, Argonne National Laboratory), P. Beckman (PI, Argonne National Laboratory), A. Malony (co-PI), 12/1/04 – 1/31/08. “Application-Specific Performance Technology for Productive Parallel Computing,” Dept. of Energy, Office of Science / MICS, DE-FG02-05ER25680, A. Malony (PI), S. Shende (co-PI), 5/1/05 - 4/30/08. “Knowledge-based Parallel Performance Technology,” Dept. of Energy, Office of Science, DE-FG02- 07ER25826, A. Malony (PI), S. Shende (co-PI), 9/1/2007 - 8/31/2010. “Performance Refactoring of Instrumentation, Measurement, and Analysis Technologies for Petascale Computing: the PRIMA Project,” DOE, Office of Science, DE-FG02-09ER25873, A. Malony (PI), F. Wolf (co-PI), S. Shende (co-PI), B. Mohr (co-PI, Research Centre Juelich), 6/1/2009 -- 5/31/2012. “MOGO: Model Oriented Global Optimization of Petascale Scientific Applications,” DOE, Office of Science, DE-SC0001777, J. Vetter (PI), A. Malony (Co-PI), P. Beckman (co-PI), 9/1/2009 – 8/31/2012. “Vancouver: Designing a Next-Generation Software Infrastructure for Productive HeterogeniousExascale Computing,” Dept. of Energy, DE-SC0005360, J. Vetter (PI), A. Malony (co-PI), S. Shende (co-PI), R. Vuduc (co-PI), W.-M. Hwu (co-PI), 9/1/10 - 8/31/13. “Autotuning Large Computational Chemistry Codes,” Dept. of Energy, Lawrence Berkeley National Laboratory, 6939279, D. Bailey (PI), A. Malony (co-PI), S. Shende (co-PI), 11/3/10 - 3/31/12. “Data Management, Performance Tuning and Analysis for the Global Cloud Resolving Model,” Dept. of Energy, Pacific Northwest National Laboratory, 113907, S. Shende (PI), A. Malony (co-PI), $167,524, 5/12/10 - 9/30/11. SUPER

Automated source instrumentationModeling and computational QoS CCA Kernel-level measurement Runtime scalable monitoring ZeptoOS PRIMA Glassbox MOGO POINT Vancouver ASC

TAU NSF Funding “SI2-SSE:Collaborative Research: A Glass Box Approach to Enabling Open, Deep Interactions in the HPC Toolchain,” NSF Office of Cyberinfrastructure, K. Schwan (PI, Georgia Tech), A. Malony (co-PI), B. Chapman (co-PI, U. Houston), “SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications,” National Science Foundation, Software Development for Cyberinfrastructure(SDCI), OCI-0722072, A. Malony (PI), S. Shende (co-PI), N. Nystrom (co-PI, Pittsburgh Supercomputing Center), S. Moore (co-PI, University of Tennessee), R. Kufrin (co-PI, National Center for Supercomputing Applications, University of Illinois), 11/1/2007 - 10/31/2010. “ST-HEC: Collaborative Research: Scalable, Interoperable Tools to Support Autonomic Optimization of High-End Applications,’’NSF High-End Computing (HEC), NSF CCF 0444475, S. McKee (PI, Cornell University), {A. Malony} (co-PI), G. Tyson (co-PI, Florida State University),11/1/04--10/31/07. “MRI-R2: Acquisition of an Applied Computational Instrument for Scientific Synthesis (ACISS),’’National Science Foundation, Office of Cyber Infrastructure, OCI-0960354, A. Malony(PI), D. Tucker (co-PI), J. Conery (co-PI), M. Guenza (co-PI), S. Lockery (co-PI)5/1/10 -- 4/30/13. “Acquisition of the Oregon ICONIC Grid for Integrated Cognitive Neuroscience, Informatics, and Computation,’’NSF Major Research Instrumentation, NSF BCS-0321388, A. Malony(PI), D. Tucker (co-PI), M. Posner (co-PI),J. Conery (co-PI), R. Nunnally (co-PI),9/1/03--8/31/06 (extended to 12/1/06). SUPER

Model Oriented Global Optimization (MOGO) • DOE ASCR (base funding) • ORNL, ANL, UO

Performance Refactoring: Instrumentation, Measurement, and Analysis (PRIMA) • DOE ASCR (base funding) • University of Oregon and Research Centre Juelich • Focus on TAU and Scalasca • Refactor instrumentation, measurement, and analysis • Build next-generation tools on new common foundation • Strong need to integrate technology • Integration of instrumentation and measurement • Create core infrastructure (avoid duplication, share) • Extend to involve the SILC project • Juelich, TU Dresden, TU Munich • Fully-integrated measurement infrastructure – Score-P

Score-P Architecture TAU Vampir Scalasca TAU Periscope Score-P measurement infrastructure Event traces (OTF2) Call-path profiles (CUBE4) Online interface Hardware counter (PAPI) supplementalinstrumentation+ measurementsupport TAU adaptor Application (MPI, OpenMP, hybrid) Instrumentation MPI wrapper Compiler TAU instrumentor OPARI 2 COBI

Heterogeneous Exascale Software (Vancouver) • DOE X-stack program • ORNL, UO, UIUC, GT • Compilers • Runtimeresourcemanagement • Libraries • Performancemeasurement,analysis, modeling

High-Productivity Performance Engineering (POINT) Testbed AppsENZO NAMDNEMO3D

Performance API (PAPI) • PAPI is middleware that provides a consistent interface and methodology for the performance counter hardware in major microprocessors • PAPI enables software engineers to see the relation between software performance and hardware events • PAPI component architecture provides access to a collection of components that expose performance measurement opportunities across the system • network, I/O system, accelerators, power/energy

Funding Support of PAPI • DOE support • PERC • PERI • Contributed to the PAPI technology now being used in the SUPER project http://icl.cs.utk.edu/papi/

Tools Applying PAPI PaRSEC (UTK) http://icl.cs.utk.edu/parsec/ TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/ PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/ HPCToolkit (Rice University) http://hpctoolkit.org/ KOJAK and SCALASCA (FZ Juelich, UTK) http://icl.cs.utk.edu/kojak/ VampirTrace and Vampir (TU Dresden) http://www.vamir.eu Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/ SvPablo (UNC Renaissance Computing Institute)http://www.renci.org/research/pablo/ ompP(UTK)http://www.ompp-tool.com

The PAPI performance monitoring library has provided consistent platform and operating system independent access to CPU hardware performance counters for more than a decade The primary application of PAPI as a "middleware" has focused on enabling parallel application performance analysis tools (e.g., TAU, HPCToolkit, CrayPat, Vampir) to gather performance counter data on large scale DOE computing systems through a common and coherent interface The events that can be monitored involve a wide range of performance-relevant architectural features, such as cache misses, floating point operations, retired instructions, executed cycles, and many others

Over time, other system components, beyond the processor, have gained performance interfaces, e.g., GPU accelerators, network interfaces, I/O systems To address this change, PAPI was redesigned to have a component architecture that allows for modular access to these new sources of performance data. With this redesign, additional PAPI components have been developed to address subsets of data and communication domains, measure I/O performance, and monitor synchronization and data exchange between computing elements

These advances were supported by the DOE SciDAC projects: PERC (2001-2006) and PERI (2006-2011), as well as by a DOE ACSR basic research grant (2002-2005), that contributed to the PAPI technology we are now using in the SUPER project.

Timeline ASCR SciDAC 2001 2002 2003 2004 2005 2006 2007 2008 2009 2009 2010 2011 2012 PERC

Modeling Performance and Power (UCSD/SDSC) • How can we get energy efficient HPC? • Understand and model how computation and communication patterns affect the overall performance and energy requirements of HPC applications • Use performance and power models to design software- and hardware-aware green optimization techniques to reduce HPC’s energy footprint • Funding heritage • DOE (ASCR, PERC, PERI) • DoD, NSF

Application Characterizationwith PEBIL and PSiNstracer • Capture fundamental operations used by the application • Requires low-level details of application • Details attached to specific structures within the application • Analysis required on large-scale production codes • PEBIL binary instrumentation • Static Analysis Tools • memory, FP counts, operation parallelism, program structure • Dynamic (runtime) analysis Tools • cache hit rates, execution counts, loop length • PSiNstracer communication characterization • Profiles all communication routines during a run

Active Harmony Evaluated Performance Active Harmony Client Application REPORT 3 2 1 Search Strategy 2 3 1 FETCH Candidate Points • Active Harmony (AH) is an auto-tuning framework that supports online and offline auto-tuning • Flexible, plugin-based architecture • How does it works? • Measures program performance • Adapts tunable parameters • Search heuristics explore options • AH funding • NSF (1997–2000) • DOD (1997–2000, 2010–present • DOE (Base Projects, 2001–2012) • DOE (SciDAC, 2001–present)

AH FFT Auto-tuning our NEW method 1.76x over FFTW 20483complexnumbers 256 cores on Hopper TH approach not so fast AH NEW and TH method have overlap FFTW does not have overlap

AH Tool Integration CHiLLplugin TAU plugin

mpiP • mpiP is a lightweight and scalable profiling tool for MPI applications • Originally developed with ASC (then ASCI) funding • PERC and PERI funding helped to maintain and improve • SUPER is extending mpiP to collect communication topology information for point-to-point and collective communication • SciDAC application characterization studies • Additional benchmarks and applications DOE-funded Oxbow project • With SUPER and other DOE funding developed an automated approach for characterizing the communication topology data collected by mpiP http://mpip.sourceforge.net

GPTL • General Purpose Timing Library from ORNL • J. Rosinki (originally) and P. Worley (CESM version) • Developed initially for Community Climate System Model • NSF funded initially, then DOE • Motivation is having a lightweight profiling tool that can be bundled with application codes and used during production runs to get general performance data • More informative, much lighter weight, and much more robust than the handrolledinstrumentation layers • Use now with climate and fusion codes

RCRToolkit Resource Centric Performance Reflection Technology: Node-wide continuous monitoring and analysis of performance data from "outside the core" for use in introspective, dynamic adaptation (a.k.a. Resource Centric Reflection): uncore hardware on chip, I/O devices, power and energy monitors, OS events (scheduling, networking, I/O), global data from other nodes Multiple clients (sources and users) share data through a shared blackboard structure(RCRblackboard)

RCRToolkit • Intialfunding through the DoD ACS MAESTRO project • Additional current funding from DoD ACS ATPER project • Some DOE effort from projects that use RCR • Sandia XGC project • XStackXPRESS. • Distributed system monitoring work funded by and used by NSF GENI • Translating to SUPER • Power / energy / performance variability studies • Energy adaption • Impact • Realtime data for adaptive scheduling for power and energy • Inter-socket and inter-node variation are sufficiently large • Impact on deterministic strategies for (auto)tuning via incremental gradient methods • Looking at SciDAC end applications amenable to using

Jobs Scheculer Performance Tools Power Control Tools App 3 App 2 App 1 RCRToolkit RCR Logger RCR Blackboard RCR Viewer . . . RCR Daemon RCR Daemon

RCRToolkit • Resource Centric Performance Reflection • Performance measurement and analysis tool that focuses on shared resources in a system • Information and analysis should help applications and system code introspectively, in real-time, to adapt to contentions at shared resources • RCRToolkit consists of • RCRblackboard • Several clients for the RCRblackboard • RCRblackboard • Shared memory region (or, currently Google protocol buffers resident in memory) for real time use by producers and consumers of node- and system- wide performance information • Information organized in a hierarchy that reflects hardware structure • Coordination managed by the RCRblackboard protocol – single writer for ‘owned’ regions and multiple readers

Autotuning PipelineCHiLL (USC + Utah) • Funding history • NSF Next Generation Software Program, Federica Darema, 2002 • NSF CSR, 2005 • PERI, 2006 • ASCR XTUNE, 2008 • Initial empirical optimization work • Integrated this initial in SciDAC with other research • CHiLLautotuning system • integrated into the PERI autotuning framework • Broadening the autotuning research agenda in SUPER • Heterogeneous systems • Other objectives, in particular energy and resilience

Resilience Pipeline (USC) • Initial work to create an user API for expressing knowledge of application requirements • Supported by the Semiconductor Research Corporation (SRC) • Part of the Multiscale Systems (MUSYC) Focused Center Research Program (FCRP) • Work continues in SUPER • Collaborating with LLNL’s resilience research team • Broaden the space of applications and assertions • New grant from ARO • Transition technology into the ROSE compiler (LLNL) • Create runtime system based on JPL technology • Make more broadly available

Resilience Pipeline (Utah) • Additional NSF and SRC funding with Utah • Automatic derivation of predicates • Help detect silent errors • Hardware component based FPGAs • Use FPGAs as co-processors • Originally funded by DARPA under the ACS (Adaptive Computing Systems) • During time Mary Hall was at ISI

Modeling Performance through Source Analysis • Performance bounds give the upper limit in performance that can be expected for a given application on a given system • Different existing approaches: • Fully automatic (ignores machine information) • Theoretical peak (based on FP units) • Fully dynamic (profiling-based, time, overhead) • Pbound approach • Application signatures + architecture  bounds

PBound • Developed under PERC, PERI, and SUPER • ROSE-based tool that generates performance bounds from source code (C, C++, Fortran) • Example: what is the best achievable execution time? • Based on static (source code) analysis • Produces parameterized closed-form expressions expressing the computational and data load/store requirements of application kernels • Coupled with architectural information • Produces upper bounds on the performance of the application

Roofline Modeling • Roofline models characterize architectures and help visualize application performance within the architectural roofline [Williams 2009] • Shows the rangeof possible application performance • Determines how optimizations affect application performance • The performance space is determined by either • Static performance models, such as those generated by Pbound • Empirical models based upon platform experiments

SUPER Augmentation Funding for Roofline Best (Pbound) Measured

References Weaver, V., Terpstra, D., McCraw, H., Johnson, M., Kasichayanula, K., Ralph, J., Nelson, J., Mucci, P., Mohan, T., Moore, S. "PAPI 5: Measuring Power, Energy, and the Cloud,"Poster Abstract, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, April 21-23, 2013. McCraw, H., Terpstra, D., Dongarra, J., Davis, K., Musselman R. "Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q,"International Supercomputing Conference 2013 (ISC'13), Leipzig, Germany, J.M. Kunkel, T. Ludwig, and H. Meuer (Eds.): ISC 2013, LNCS 7905, pp. 213--225. Springer, Heidelberg eds. 2013. Johnson, M., McCraw, H., Moore, S., Mucci, P., Nelson, J., Terpstra, D., Weaver, V., Mohan, T. "PAPI-V: Performance Monitoring for Virtual Machines,"CloudTech-HPC 2012, Pittsburgh, PA, September 10-13, 2012.

References [1]: Green Queue: Customized Large-scale DVFS Tiwari, A., Laurenzano, M., Peraza, J., Carrington, L., and Snavely, A.Cloud and Green Computing (CGC), 2012, China [2]: PMaC's Green Queue: A Framework for Selecting Energy Optimal DVFS Configurations in Large Scale MPI ApplicationsPeraza J, Tiwari A, Laurenzano M, Carrington L, Snavely AConcurrency and Computation: Practice and Experience 2013 [3]: PEBIL: Efficient Static Binary Instrumentation for LinuxLaurenzano, M., Tikir, M., Carrington, L., and Snavely, A. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), White Plains, NY 2010. [4]: PEBIL: Binary Instrumentation for Practical Data-Intensive Program AnanlysisLaurenzano M, Peraza J, Carrington L, Tiwari A, Ward W, Campbell RCluster Computing Special Issue on Data-Intensive High Performance Computing 2013 [5]: PSINS: An Open Source Event Tracer and Execution Simulator for MPI ApplicationsTikir, M., Laurenzano, M., Carrington, L., and Snavely, A. Euro-Par 2009, Delft, The Netherlands

References J. K. Hollingsworth and S. Song, “Designing and Auto-Tuning Parallel 3-D FFT for Computation-Communication Overlap”, Feb. 2014 PPOPP'14 J. K. Hollingsworth and R. Chen, Towards Fully Automatic Auto-tuning: Leveraging language features of Chapel, Nov. 2013 International Journal of High Performance Computing Applications, 27(4). A. Tiwari, J. K. Hollingsworth, “Online Adaptive Code Generation and Tuning”, IPDPS 2011, Anchorage, AK, May 2011 (Best Paper – Software).

mpiP J.S. Vetter, M.O. McCracken, “Statistical Scalability Analysis of Communication Operations in Distributed Applications,” 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPoPP’01), Snowbird, Utah, June 2001, pp. 123-132. J.S. Vetter, S. Lee, D. Li, G. Marin, C. McCurdy, J. Meredith, P.C. Roth, K. Spafford, “Quantifying Architectural Requirements of Contemporary Extreme-Scale Scientific Applications,” International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems (PMBS13), Denver, Colorado, November 2013.

Geant4 Performance Analysis and Tuning • Geant 4 is extremely important to the design and execution of HEG experiments • How to evolve design to best exploitcurrent/future architectures? • Geant4 tHEPand ASCR partnership • Not a standard performanceanalysis/tuning scenario • Quantifying performance impact ofOO design choices • Class-based performance analysis • Polymorphism (same function name, many implementations) • Virtual functions (what object types are functions invoked on?)

Using TAU in Geant4 • TAU used to collect data for Simplified Calorimeter experiment • Sampling profiles: low-overhead measurements of full-scale experiments • Instrumentation-based: selectively instrumented classes and functions to collect precise measurements for functions (and whole classes) identified through sampling • Data stored in TAUdb (publicly shared with physics collaborators) • New types of analysis enabled by TAUdb and PerfExplorer • Create class-based profiles for performance hardware counters and derived metrics • Compare impacts of both high-level (design) optimizations, e.g., changing inheritance structure and low-level optimizations on performance metrics (cache misses, vectorization, etc.)

Bob Lucas University of Southern California Sept. 23, 2011