1 / 82

TAU Parallel Performance System for High-End Parallel Computing

Explore the TAU Parallel Performance System for productive and high-end parallel computing. Learn about its instrumentation, measurement, analysis tools, and performance data management.

jtodd
Download Presentation

TAU Parallel Performance System for High-End Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Technology for Productive, High-End Parallel Computing:the TAU Parallel Performance System Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon

  2. Outline • Research interests and motivation • TAU performance system • Instrumentation • Measurement • Analysis tools • Parallel profile analysis (ParaProf) • Performance data management (PerfDMF) • Performance data mining (PerfExplorer) • TAU on Solaris 10 • ZeptoOS and KTAU

  3. PerformanceTuning PerformanceTechnology hypotheses Performance Diagnosis • Experimentmanagement • Performancedata storage PerformanceTechnology properties Performance Experimentation • Instrumentation • Measurement • Analysis • Visualization characterization Performance Observation Research Motivation • Tools for performance problem solving • Empirical-based performance optimization process • Performance technology concerns

  4. Challenges in Performance Problem Solving • How to make the process more effective (productive)? • Process likely to change as parallel systems evolve • What are the important events and performance metrics? • Tied to application structure and computational model • Tied to application domain and algorithms • What are the significant issues that will affect the technology used to support the process? • Enhance application development and optimization • Process and tools can/must be more application-aware • Tools have poor support for application-specific aspects • Integrate performance technology and process

  5. Performance Process, Technology, and Scale • How does our view of this process change when we consider very large-scale parallel systems? • Scaling complicates observation and analysis • Performance data size • standard approaches deliver a lot of data with little value • Measurement overhead and intrusion • tradeoff with analysis accuracy • “noise” in the system • Analysis complexity increases • What will enhance productive application development? • Process and technology evolution • Nature of application development may change

  6. Role of Intelligence, Automation, and Knowledge • Scale forces the process to become more intelligent • Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable • More automation and knowledge-based decision making • Build automatic/autonomic capabilities into the tools • Support broader experimentation methods and refinement • Access and correlate data from several sources • Automate performance data analysis / mining / learning • Include predictive features and experiment refinement • Knowledge-driven adaptation and optimization guidance • Address scale issues through increased expertise

  7. TAU Performance System • Tuning and Analysis Utilities (14+ year project effort) • Performance system framework for HPC systems • Integrated, scalable, flexible, and parallel • Targets a general complex system computation model • Entities: nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance problem solving • Instrumentation, measurement, analysis, and visualization • Portable performance profiling and tracing facility • Performance data management and data mining • Partners: LLNL, ANL, Research Center Jülich, LANL

  8. TAU Parallel Performance System Goals • Portable (open source) parallel performance system • Computer system architectures and operating systems • Different programming languages and compilers • Multi-level, multi-language performance instrumentation • Flexible and configurable performance measurement • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component • Support for performance mapping • Integration of leading performance technology • Scalable (very large) parallel performance analysis

  9. General Complex System Computation Model • Node: physically distinct shared memory machine • Message passing node interconnection network • Context: distinct virtual memory space within node • Thread: execution threads (user/system) in context Interconnection Network Inter-node messagecommunication * * Node Node Node node memory memory memory physicalview SMP VM space … modelview … Context Threads

  10. TAU Performance System Architecture

  11. TAU Performance System Architecture

  12. TAU Instrumentation Approach • Support for standard program events • Routines, classes and templates • Statement-level blocks • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Selection of event statistics • Support definition of “semantic” entities for mapping • Support for event groups (aggregation, selection) • Instrumentation optimization • Eliminate instrumentation in lightweight routines

  13. TAU Instrumentation Mechanisms • Source code • Manual (TAU API, TAU component API) • Automatic (robust) • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP2 spec) • Object code • Pre-instrumented libraries (e.g., MPI using PMPI) • Statically-linked and dynamically-linked • Executable code • Dynamic instrumentation (pre-execution) (DynInstAPI) • Virtual machine instrumentation (e.g., Java using JVMPI) • TAU_COMPILER to automate instrumentation process

  14. User-level abstractions problem domain linker OS Multi-Level Instrumentation and Mapping • Multiple interfaces • Information sharing • Between interfaces • Event selection • Within/between levels • Mapping • Associate performance data with high-level semantic abstractions source code instrumentation preprocessor instrumentation source code instrumentation compiler instrumentation object code libraries executable instrumentation instrumentation runtime image instrumentation instrumentation VM performancedata run

  15. Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation tau_instrumentor DUCTAPE Program Database Toolkit (PDT)

  16. Program Database Toolkit (PDT) • Program code analysis framework • Develop source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • Commercial grade front-end parsers • Portable IL analyzer, database format, and access API • Open software approach for tool development • Multiple source languages • Implement automatic performance instrumentation tools  tau_instrumentor

  17. TAU Measurement Approach • Portable and scalable parallel profiling solution • Multiple profiling types and options • Event selection and control (enabling/disabling, throttling) • Online profile access and sampling • Online performance profile overhead compensation • Portable and scalable parallel tracing solution • Trace translation to Open Trace Format (OTF) • Trace streams and hierarchical trace merging • Robust timing and hardware performance support • Multiple counters (hardware, user-defined, system) • Performance measurement for CCA component software

  18. TAU Measurement Mechanisms • Parallel profiling • Function-level, block-level, statement-level • Supports user-defined events and mapping events • TAU parallel profile stored (dumped) during execution • Support for flat, callgraph/callpath, phase profiling • Support for memory profiling • Tracing • All profile-level events • Inter-process communication events • Inclusion of multiple counter data in traced events

  19. Types of Parallel Performance Profiling • Flat profiles • Metric (e.g., time) spent in an event (callgraph nodes) • Exclusive/inclusive, # of calls, child calls • Callpath profiles (Calldepth profiles) • Time spent along a calling path (edges in callgraph) • “main=> f1 => f2 => MPI_Send” (event name) • TAU_CALLPATH_LENGTH environment variable • Phase profiles • Flat profiles under a phase (nested phases are allowed) • Default “main” phase • Supports static or dynamic (per-iteration) phases

  20. Performance Analysis and Visualization • Analysis of parallel profile and trace measurement • Parallel profile analysis • ParaProf: parallel profile analysis and presentation • ParaVis: parallel performance visualization package • Profile generation from trace data (tau2pprof) • Performance data management framework (PerfDMF) • Parallel trace analysis • Translation to VTF (V3.0), EPILOG, OTF formats • Integration with VNG (Technical University of Dresden) • Online parallel analysis and visualization • Integration with CUBE browser (KOJAK, UTK, FZJ)

  21. ParaProf Parallel Performance Profile Analysis Raw files HPMToolkit PerfDMFmanaged (database) Metadata MpiP Application Experiment Trial TAU

  22. Example Applications • sPPM • ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads • Miranda • research hydrodynamics code, Fortran, MPI • GYRO • tokamak turbulence simulation, Fortran, MPI • FLASH • physics simulation, Fortran, MPI • WRF • weather research and forecasting, Fortran, MPI • S3D • 3D combustion, Fortran, MPI

  23. ParaProf – Flat Profile (Miranda, BG/L) node, context, thread 8K processors Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

  24. ParaProf – Stacked View (Miranda)

  25. ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

  26. ParaProf – Histogram View (Miranda) 8k processors 16k processors

  27. NAS BT – Flat Profile How is MPI_Wait()distributed relative tosolver direction? Application routine names reflect phase semantics

  28. NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

  29. ParaProf – 3D Full Profile (Miranda) 16k processors

  30. ParaProf – 3D Full Profile (Flash) 128 processors

  31. ParaProf – 3D Scatterplot (Miranda) • Each pointis a “thread”of execution • A total offour metricsshown inrelation • ParaVis 3Dprofilevisualizationlibrary • JOGL

  32. ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

  33. Performance Tracing on Miranda • Use TAU to generate VTF3 traces for Vampir analysis • MPI calls with HW counter information (not shown) • Detailed code behavior to focus optimization efforts

  34. S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

  35. S3D on Lemieux (Zoomed)

  36. Hypothetical Mapping Example • Particles distributed on surfaces of a cube Particle* P[MAX];/* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] = ... f(face); ... } last+= particles_on_this_face; } }

  37. Hypothetical Mapping Example (continued) • How much time (flops) spent processing face i particles? • What is the distribution of performance among faces? • How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } work packets … engine

  38. Typical performance tools report performance with respect to routines Does not provide support for mapping TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions No Performance Mapping versus Mapping TAU (w/ mapping) TAU (no mapping)

  39. Component-Based Scientific Applications • How to support performance analysis and tuning process consistent with application development methodology? • Common Component Architecture (CCA) applications • Performance tools should integrate with software • Design performance observation component • Measurement port and measurement interfaces • Build support for application component instrumentation • Interpose a proxy component for each port • Inside the proxy, track caller/callee invocations, timings • Automate the process of proxy component creation • using PDT for static analysis of components • include support for selective instrumentation

  40. Flame Reaction-Diffusion (Sandia) CCAFFEINE

  41. Earth Systems Modeling Framework • Coupled modeling with modular software framework • Instrumentation for ESMF framework and applications • PDT automatic instrumentation • Fortran 95 code modules • C / C++ code modules • MPI wrapper library for MPI calls • ESMF component instrumentation (using CCA) • CCA measurement port manual instrumentation • Proxy generation using PDT and runtime interposition • Significant callpath profiling used by ESMF team

  42. Using TAU Component in ESMF/CCA

  43. Important Questions for Application Developers • How does performance vary with different compilers? • Is poor performance correlated with certain OS features? • Has a recent change caused unanticipated performance? • How does performance vary with MPI variants? • Why is one application version faster than another? • What is the reason for the observed scaling behavior? • Did two runs exhibit similar performance? • How are performance data related to application events? • Which machines will run my code the fastest and why? • Which benchmarks predict my code performance best?

  44. Performance Problem Solving Goals • Answer questions at multiple levels of interest • Data from low-level measurements and simulations • use to predict application performance • High-level performance data spanning dimensions • machine, applications, code revisions, data sets • examine broad performance trends • Discover general correlations application performance and features of their external environment • Develop methods to predict application performance on lower-level metrics • Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

  45. Performancedatabase Automatic Performance Analysis Tool (Concept) 105% Faster! 72% Faster! Simpleanalysisfeedback Build application Execute application environment /performancedata build information Offline analysis

  46. Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005. (awarded best paper)

  47. Performance Data Mining (Objectives) • Conduct parallel performance analysis in a systematic, collaborative and reusable manner • Manage performance complexity • Discover performance relationship and properties • Automate process • Multi-experiment performance analysis • Large-scale performance data reduction • Summarize characteristics of large processor runs • Implement extensible analysis framework • Abtraction / automation of data mining operations • Interface to existing analysis and data mining tools

  48. Performance Data Mining (PerfExplorer) • Performance knowledge discovery framework • Data mining analysis applied to parallel performance data • comparative, clustering, correlation, dimension reduction, … • Use the existing TAU infrastructure • TAU performance profiles, PerfDMF • Client-server based system architecture • Technology integration • Java API and toolkit for portability • PerfDMF • R-project/Omegahat, Octave/Matlab statistical analysis • WEKA data mining package • JFreeChart for visualization, vector output (EPS, SVG)

  49. Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

  50. PerfExplorer Analysis Methods • Data summaries, distributions, scatterplots • Clustering • k-means • Hierarchical • Correlation analysis • Dimension reduction • PCA • Random linear projection • Thresholds • Comparative analysis • Data management views

More Related