TAU Parallel Performance System for High-End Parallel Computing

Performance Technology for Productive, High-End Parallel Computing:the TAU Parallel Performance System Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon

Outline • Research interests and motivation • TAU performance system • Instrumentation • Measurement • Analysis tools • Parallel profile analysis (ParaProf) • Performance data management (PerfDMF) • Performance data mining (PerfExplorer) • TAU on Solaris 10 • ZeptoOS and KTAU

PerformanceTuning PerformanceTechnology hypotheses Performance Diagnosis • Experimentmanagement • Performancedata storage PerformanceTechnology properties Performance Experimentation • Instrumentation • Measurement • Analysis • Visualization characterization Performance Observation Research Motivation • Tools for performance problem solving • Empirical-based performance optimization process • Performance technology concerns

Challenges in Performance Problem Solving • How to make the process more effective (productive)? • Process likely to change as parallel systems evolve • What are the important events and performance metrics? • Tied to application structure and computational model • Tied to application domain and algorithms • What are the significant issues that will affect the technology used to support the process? • Enhance application development and optimization • Process and tools can/must be more application-aware • Tools have poor support for application-specific aspects • Integrate performance technology and process

Performance Process, Technology, and Scale • How does our view of this process change when we consider very large-scale parallel systems? • Scaling complicates observation and analysis • Performance data size • standard approaches deliver a lot of data with little value • Measurement overhead and intrusion • tradeoff with analysis accuracy • “noise” in the system • Analysis complexity increases • What will enhance productive application development? • Process and technology evolution • Nature of application development may change

Role of Intelligence, Automation, and Knowledge • Scale forces the process to become more intelligent • Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable • More automation and knowledge-based decision making • Build automatic/autonomic capabilities into the tools • Support broader experimentation methods and refinement • Access and correlate data from several sources • Automate performance data analysis / mining / learning • Include predictive features and experiment refinement • Knowledge-driven adaptation and optimization guidance • Address scale issues through increased expertise

TAU Performance System • Tuning and Analysis Utilities (14+ year project effort) • Performance system framework for HPC systems • Integrated, scalable, flexible, and parallel • Targets a general complex system computation model • Entities: nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance problem solving • Instrumentation, measurement, analysis, and visualization • Portable performance profiling and tracing facility • Performance data management and data mining • Partners: LLNL, ANL, Research Center Jülich, LANL

TAU Parallel Performance System Goals • Portable (open source) parallel performance system • Computer system architectures and operating systems • Different programming languages and compilers • Multi-level, multi-language performance instrumentation • Flexible and configurable performance measurement • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component • Support for performance mapping • Integration of leading performance technology • Scalable (very large) parallel performance analysis

General Complex System Computation Model • Node: physically distinct shared memory machine • Message passing node interconnection network • Context: distinct virtual memory space within node • Thread: execution threads (user/system) in context Interconnection Network Inter-node messagecommunication * * Node Node Node node memory memory memory physicalview SMP VM space … modelview … Context Threads

TAU Performance System Architecture

TAU Instrumentation Approach • Support for standard program events • Routines, classes and templates • Statement-level blocks • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Selection of event statistics • Support definition of “semantic” entities for mapping • Support for event groups (aggregation, selection) • Instrumentation optimization • Eliminate instrumentation in lightweight routines

TAU Instrumentation Mechanisms • Source code • Manual (TAU API, TAU component API) • Automatic (robust) • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP2 spec) • Object code • Pre-instrumented libraries (e.g., MPI using PMPI) • Statically-linked and dynamically-linked • Executable code • Dynamic instrumentation (pre-execution) (DynInstAPI) • Virtual machine instrumentation (e.g., Java using JVMPI) • TAU_COMPILER to automate instrumentation process

User-level abstractions problem domain linker OS Multi-Level Instrumentation and Mapping • Multiple interfaces • Information sharing • Between interfaces • Event selection • Within/between levels • Mapping • Associate performance data with high-level semantic abstractions source code instrumentation preprocessor instrumentation source code instrumentation compiler instrumentation object code libraries executable instrumentation instrumentation runtime image instrumentation instrumentation VM performancedata run

Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation tau_instrumentor DUCTAPE Program Database Toolkit (PDT)

Program Database Toolkit (PDT) • Program code analysis framework • Develop source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • Commercial grade front-end parsers • Portable IL analyzer, database format, and access API • Open software approach for tool development • Multiple source languages • Implement automatic performance instrumentation tools  tau_instrumentor

TAU Measurement Approach • Portable and scalable parallel profiling solution • Multiple profiling types and options • Event selection and control (enabling/disabling, throttling) • Online profile access and sampling • Online performance profile overhead compensation • Portable and scalable parallel tracing solution • Trace translation to Open Trace Format (OTF) • Trace streams and hierarchical trace merging • Robust timing and hardware performance support • Multiple counters (hardware, user-defined, system) • Performance measurement for CCA component software

TAU Measurement Mechanisms • Parallel profiling • Function-level, block-level, statement-level • Supports user-defined events and mapping events • TAU parallel profile stored (dumped) during execution • Support for flat, callgraph/callpath, phase profiling • Support for memory profiling • Tracing • All profile-level events • Inter-process communication events • Inclusion of multiple counter data in traced events

Types of Parallel Performance Profiling • Flat profiles • Metric (e.g., time) spent in an event (callgraph nodes) • Exclusive/inclusive, # of calls, child calls • Callpath profiles (Calldepth profiles) • Time spent along a calling path (edges in callgraph) • “main=> f1 => f2 => MPI_Send” (event name) • TAU_CALLPATH_LENGTH environment variable • Phase profiles • Flat profiles under a phase (nested phases are allowed) • Default “main” phase • Supports static or dynamic (per-iteration) phases

Performance Analysis and Visualization • Analysis of parallel profile and trace measurement • Parallel profile analysis • ParaProf: parallel profile analysis and presentation • ParaVis: parallel performance visualization package • Profile generation from trace data (tau2pprof) • Performance data management framework (PerfDMF) • Parallel trace analysis • Translation to VTF (V3.0), EPILOG, OTF formats • Integration with VNG (Technical University of Dresden) • Online parallel analysis and visualization • Integration with CUBE browser (KOJAK, UTK, FZJ)

ParaProf Parallel Performance Profile Analysis Raw files HPMToolkit PerfDMFmanaged (database) Metadata MpiP Application Experiment Trial TAU

Example Applications • sPPM • ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads • Miranda • research hydrodynamics code, Fortran, MPI • GYRO • tokamak turbulence simulation, Fortran, MPI • FLASH • physics simulation, Fortran, MPI • WRF • weather research and forecasting, Fortran, MPI • S3D • 3D combustion, Fortran, MPI

ParaProf – Flat Profile (Miranda, BG/L) node, context, thread 8K processors Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

ParaProf – Stacked View (Miranda)

ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

ParaProf – Histogram View (Miranda) 8k processors 16k processors

NAS BT – Flat Profile How is MPI_Wait()distributed relative tosolver direction? Application routine names reflect phase semantics

NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

ParaProf – 3D Full Profile (Miranda) 16k processors

ParaProf – 3D Full Profile (Flash) 128 processors

ParaProf – 3D Scatterplot (Miranda) • Each pointis a “thread”of execution • A total offour metricsshown inrelation • ParaVis 3Dprofilevisualizationlibrary • JOGL

ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

Performance Tracing on Miranda • Use TAU to generate VTF3 traces for Vampir analysis • MPI calls with HW counter information (not shown) • Detailed code behavior to focus optimization efforts

S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

S3D on Lemieux (Zoomed)

Hypothetical Mapping Example • Particles distributed on surfaces of a cube Particle* P[MAX];/* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] = ... f(face); ... } last+= particles_on_this_face; } }

Hypothetical Mapping Example (continued) • How much time (flops) spent processing face i particles? • What is the distribution of performance among faces? • How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } work packets … engine

Typical performance tools report performance with respect to routines Does not provide support for mapping TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions No Performance Mapping versus Mapping TAU (w/ mapping) TAU (no mapping)

Component-Based Scientific Applications • How to support performance analysis and tuning process consistent with application development methodology? • Common Component Architecture (CCA) applications • Performance tools should integrate with software • Design performance observation component • Measurement port and measurement interfaces • Build support for application component instrumentation • Interpose a proxy component for each port • Inside the proxy, track caller/callee invocations, timings • Automate the process of proxy component creation • using PDT for static analysis of components • include support for selective instrumentation

Flame Reaction-Diffusion (Sandia) CCAFFEINE

Earth Systems Modeling Framework • Coupled modeling with modular software framework • Instrumentation for ESMF framework and applications • PDT automatic instrumentation • Fortran 95 code modules • C / C++ code modules • MPI wrapper library for MPI calls • ESMF component instrumentation (using CCA) • CCA measurement port manual instrumentation • Proxy generation using PDT and runtime interposition • Significant callpath profiling used by ESMF team

Using TAU Component in ESMF/CCA

Important Questions for Application Developers • How does performance vary with different compilers? • Is poor performance correlated with certain OS features? • Has a recent change caused unanticipated performance? • How does performance vary with MPI variants? • Why is one application version faster than another? • What is the reason for the observed scaling behavior? • Did two runs exhibit similar performance? • How are performance data related to application events? • Which machines will run my code the fastest and why? • Which benchmarks predict my code performance best?

Performance Problem Solving Goals • Answer questions at multiple levels of interest • Data from low-level measurements and simulations • use to predict application performance • High-level performance data spanning dimensions • machine, applications, code revisions, data sets • examine broad performance trends • Discover general correlations application performance and features of their external environment • Develop methods to predict application performance on lower-level metrics • Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Performancedatabase Automatic Performance Analysis Tool (Concept) 105% Faster! 72% Faster! Simpleanalysisfeedback Build application Execute application environment /performancedata build information Offline analysis

Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005. (awarded best paper)

Performance Data Mining (Objectives) • Conduct parallel performance analysis in a systematic, collaborative and reusable manner • Manage performance complexity • Discover performance relationship and properties • Automate process • Multi-experiment performance analysis • Large-scale performance data reduction • Summarize characteristics of large processor runs • Implement extensible analysis framework • Abtraction / automation of data mining operations • Interface to existing analysis and data mining tools

Performance Data Mining (PerfExplorer) • Performance knowledge discovery framework • Data mining analysis applied to parallel performance data • comparative, clustering, correlation, dimension reduction, … • Use the existing TAU infrastructure • TAU performance profiles, PerfDMF • Client-server based system architecture • Technology integration • Java API and toolkit for portability • PerfDMF • R-project/Omegahat, Octave/Matlab statistical analysis • WEKA data mining package • JFreeChart for visualization, vector output (EPS, SVG)

Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

PerfExplorer Analysis Methods • Data summaries, distributions, scatterplots • Clustering • k-means • Hierarchical • Correlation analysis • Dimension reduction • PCA • Random linear projection • Thresholds • Comparative analysis • Data management views

TAU Parallel Performance System for High-End Parallel Computing

TAU Parallel Performance System for High-End Parallel Computing

Presentation Transcript

CSF tau Is it an informative biomarker of AD pathology

Savannah , GA 09 March 2010 John A. Turner, ORNL Allen Malony , Univ. of Oregon

Profiling S3D on Cray XT3 using TAU

A Novel Tau Signature in Neutrino Telescopes

Allen D. Roberts and Stephen D. Prince

V. Volkov 1 , A. Zherdetsky 1 , S. Turovets 2 , Allen D. Malony 3

Z tau tau → mu tau

Allen D. Malony malony@cs.uoregon cs.uoregon/research/tau

Allen D. Malony malony@cs.uoregon cs.uoregon/research/tau

Aaron D. Allen

Recent Advances in the TAU Performance System Sameer Shende , Allen D. Malony University of Oregon

Allen D. Malony, Professor

Allen D. Malony malony@cs.uoregon cs.uoregon/research/tau

TAU Performance Tool on Ranger and Kraken

[PDF] Free Download Tau Zero By Poul Anderson

Allen D. Malony 1 , Scott Biersdorff 2 , Wyatt Spear 2

CSF tau Is it an informative biomarker of AD pathology

Allen D. Malony malony@cs.uoregon cs.uoregon/research/tau

Sameer Shende, Allen D. Malony, Robert Ansell-Bell {sameer,malony,bertie}@cs.uoregon

Tuning and Analysis Utilities Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

Performance Observation Sameer Shende and Allen D. Malony {sameer,malony} @ cs.uoregon