The TAU Performance System

The TAU Performance System Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute University of Oregon

Overview • Motivation and goals • TAU architecture and toolkit • Instrumentation • Measurement • Analysis • Performance mapping • Application case studies • … • TAU Integration • Work in progress • Conclusions

Instrumentation • Measurement • Analysis • Visualization Motivation • Tools for performance problem solving • Empirical-based performance optimization process • Versatile performance technology • Portable performance analysis methods PerformanceTuning hypotheses Performance Diagnosis PerformanceTechnology properties Performance Experimentation characterization Performance Observation

Problems • Diverse performance observability requirements • Multiple levels of software and hardware • Different types and detail of performance data • Alternative performance problem solving methods • Multiple targets of software and system application • Demands more robust performance technology • Broad scope of performance observation • Flexible and configurable mechanisms • Technology integration and extension • Cross-platform portability • Open, layered, and modular framework architecture

Complexity Challenges for Performance Tools • Computing system environment complexity • Observation integration and optimization • Access, accuracy, and granularity constraints • Diverse/specialized observation capabilities/technology • Restricted modes limit performance problem solving • Sophisticated software development environments • Programming paradigms and performance models • Performance data mapping to software abstractions • Uniformity of performance abstraction across platforms • Rich observation capabilities and flexible configuration • Common performance problem solving methods

General Problems (Performance Technology) How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems? 

Computation Model for Performance Technology • How to address dual performance technology goals? • Robust capabilities + widely available methods • Contend with problems of system diversity • Flexible tool composition/configuration/integration • Approaches • Restrict computation types / performance problems • machines, languages, instrumentation technique, … • limited performance technology coverage and application • Base technology on abstract computation model • general architecture and software execution features • map features/methods to existing complex system types • develop capabilities that can be adapted and optimized

General Complex System Computation Model • Node:physically distinct shared memory machine • Message passing node interconnection network • Context: distinct virtual memory space within node • Thread: execution threads (user/system) in context Interconnection Network Inter-node messagecommunication * * Node Node Node node memory memory memory SMP physicalview VM space … modelview … Context Threads

TAU Performance System • Tuning and Analysis Utilities • Performance system framework for scalable parallel and distributed high-performance computing • Targets a general complex system computation model • nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • Portable performance profiling and tracing facility • Open software approach with technology integration • University of Oregon , Forschungszentrum Jülich, LANL

Paraver EPILOG TAU Performance System Architecture

TAU Performance Systems Goals • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid • Support for performance mapping • Support for object-oriented and generic programming • Integration in complex software systems and applications

How To Use TAU? • Instrumentation • Application code and libraries • Selective instrumentation • Install, compile, and link with TAU measurement library • % configure; make clean install • Multiple configurations for different measurements options • Does not require change in instrumentation • Selective measurement control • Execute “experiments” produce performance data • Performance data generated at end or during execution • Use analysis tools to look at performance results

TAU Instrumentation Approach • Support for standard program events • Routines • Classes and templates • Statement-level blocks • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events • Selection of event statistics • Support definition of “semantic” entities for mapping • Support for event groups • Instrumentation optimization

TAU Instrumentation • Flexible instrumentation mechanisms at multiple levels • Source code • manual • automatic • C, C++, F77/90 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari)) • Object code • pre-instrumented libraries (e.g., MPI using PMPI) • statically-linked and dynamically-linked • fast breakpoints (compiler generated) • Executable code • dynamic instrumentation (pre-execution) (DynInstAPI) • virtual machine instrumentation (e.g., Java using JVMPI)

Multi-Level Instrumentation • Targets common measurement interface • TAU API • Multiple instrumentation interfaces • Simultaneously active • Information sharing between interfaces • Utilizes instrumentation knowledge between levels • Selective instrumentation • Available at each level • Cross-level selection • Targets a common performance model • Presents a unified view of execution • Consistent performance events

Program Database Toolkit (PDT) • Program code analysis framework • develop source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • Commercial grade front-end parsers • Portable IL analyzer, database format, and access API • Open software approach for tool development • Multiple source languages • Implement automatic performance instrumentation tools • tau_instrumentor

PDT Architecture and Tools Application / Library C / C++ parser Fortran 77/90 parser Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran 77/90 IL analyzer C++ / F90 interoperability CHASM Program Database Files Automatic source instrumentation TAU_instr DUCTAPE

PDT Components • Language front end • Edison Design Group (EDG): C, C++, Java • Mutek Solutions Ltd.: F77, F90 • IL Analyzer • Processes intermediate language (IL) tree from front-end • Creates “program database” (PDB) formatted file • DUCTAPE(Bernd Mohr, FZJ/ZAM, Germany) • C++ program Database Utilities and Conversion Tools APplication Environment • Processes and merges PDB files • C++ library to access the PDB for PDT applications

Instrumentation Control • Selection of which performance events to observe • Could depend on scope, type, level of interest • Could depend on instrumentation overhead • How is selection supported in instrumentation system? • No choice • Include / exclude lists (TAU) • Environment variables • Static vs. dynamic • Controlling the instrumentation of small routines • High relative measurement overhead • Significant intrusion and possible perturbation

Selective Instrumentation % tau_instrumentor Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ] For selective instrumentation, use –f option % cat selective.dat # Selective instrumentation: Specify an exclude/include list. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST # If an include list is specified, the routines in the list will be the only # routines that are instrumented. # To specify an include list (a list of routines that will be instrumented) # remove the leading # to uncomment the following lines #BEGIN_INCLUDE_LIST #int main(int, char **) #int select_ #END_INCLUDE_LIST

Overhead Analysis for Automatic Selection • Analyze the performance data to determine events with high (relative) overhead performance measurements • Create a select list for excluding those events • Rule grammar (used in tau_reduce tool) [GroupName:] Field Operator Number • GroupName indicates rule applies to events in group • Field is a event metric attribute (from profile statistics) • numcalls, numsubs, percent, usec, cumusec, count, totalcount, stdev, usecs/call, counts/call • Operator is one of >, <, or = • Number is any number • Compound rules possible using “&” between simple rules

Example Rules • #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000 • #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1 • #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5 • Scientific notation can be used

TAU Measurement • Performance information • Performance events • High-resolution timer library (real-time / virtual clocks) • General software counter library(user-defined events) • Hardware performance counters • PCL (Performance Counter Library) (ZAM, Germany) • PAPI (Performance API) (UTK, Ptools Consortium) • consistent, portable API • Organization • Node, context, thread levels • Profile groups for collective events (runtime selective) • Performance data mapping between software levels

TAU Measurement Options • Parallel profiling • Function-level, block-level, statement-level • Supports user-defined events • TAU parallel profile data stored during execution • Hardware counts values • Support for multiple counters • Support for callpath profiling • Tracing • All profile-level events • Inter-process communication events • Timestamp synchronization • Trace merging and format conversion

TAU Measurement System Configuration • configure [OPTIONS] • {-c++=<CC>, -cc=<cc>}Specify C++ and C compilers • {-pthread, -sproc , -smarts} Use pthread, SGI sproc, smarts threads • -openmp Use OpenMP threads • -opari=<dir> Specify location of Opari OpenMP tool • {-papi ,-pcl=<dir> Specify location of PAPI or PCL • -pdt=<dir> Specify location of PDT • {-mpiinc=<d>, mpilib=<d>} Specify MPI library instrumentation • -TRACE Generate TAU event traces • -PROFILE Generate TAU profiles • -PROFILECALLPATH Generate Callpath profiles (1-level) • -MULTIPLECOUNTERS Use more than one hardware counter • -CPUTIME Use usertime+system time • -PAPIWALLCLOCK Use PAPI to access wallclock time • -PAPIVIRTUAL Use PAPI for virtual (user) time …

TAU Measurement API • Initialization and runtime configuration • TAU_PROFILE_INIT(argc, argv);TAU_PROFILE_SET_NODE(myNode);TAU_PROFILE_SET_CONTEXT(myContext);TAU_PROFILE_EXIT(message);TAU_REGISTIER_THREAD(); • Function and class methods • TAU_PROFILE(name, type, group); • Template • TAU_TYPE_STRING(variable, type);TAU_PROFILE(name, type, group);CT(variable); • User-defined timing • TAU_PROFILE_TIMER(timer, name, type, group);TAU_PROFILE_START(timer);TAU_PROFILE_STOP(timer);

TAU Measurement API (continued) • User-defined events • TAU_REGISTER_EVENT(variable, event_name);TAU_EVENT(variable, value);TAU_PROFILE_STMT(statement); • Mapping • TAU_MAPPING(statement, key);TAU_MAPPING_OBJECT(funcIdVar);TAU_MAPPING_LINK(funcIdVar, key); • TAU_MAPPING_PROFILE (funcIdVar);TAU_MAPPING_PROFILE_TIMER(timer, funcIdVar);TAU_MAPPING_PROFILE_START(timer);TAU_MAPPING_PROFILE_STOP(timer); • Reporting • TAU_REPORT_STATISTICS();TAU_REPORT_THREAD_STATISTICS();

Grouping Performance Data in TAU • Profile Groups • A group of related routines forms a profile group • Statically defined • TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, … • Dynamically defined • group name based on string, such as “adlib” or “particles” • runtime lookup in a map to get unique group identifier • uses tau_instrumentor to instrument • Ability to change group names at runtime • Group-based instrumentation and measurement control

TAU Group Instrumentation Control API • Enabling Profile Groups • TAU_ENABLE_INSTRUMENTATION(); • TAU_ENABLE_GROUP(TAU_GROUP); • TAU_ENABLE_GROUP_NAME(“group name”); • TAU_ENABLE_ALL_GROUPS(); • Disabling Profile Groups • TAU_DISABLE_INSTRUMENTATION(); • TAU_DISABLE_GROUP(TAU_GROUP); • TAU_DISABLE_GROUP_NAME(); • TAU_DISABLE_ALL_GROUPS(); • Obtaining Profile Group Identifier • Runtime Switching of Profile Groups

TAU Pre-execution Control • Dynamic groups defined at file scope • Group names and group associations runtime modifiable • Controlling groups at pre-execution time • --profile <group1+group2+…+groupN> option % tau_instrumentor app.pdb app.cpp \ –o app.i.cpp –g “particles” % mpirun –np 4 application \ –profile particles+field+mesh+io • Examples: • POOMA (LANL) uses static groups • VTF (Caltech) uses dynamic group in Python-based execution instrumentation control

Configuring TAU Measurement Library • Profiling with wallclock time (on a quad PIII Linux machine) • % configure -mpiinc=/usr/local/packages/mpich/include -mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/ -useropt=-O2 -LINUXTIMERS • Tracing • % configure -mpiinc=/usr/local/packages/mpich/include -mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit -useropt=-O2 -LINUXTIMERS • Profiling with PAPI • % configure -mpiinc=/usr/local/packages/mpich/include -mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/ -useropt=-O2 -papi=/usr/local/packages/papi • % setenv PAPI_EVENT PAPI_FP_INS • % setenv PAPI_EVENT PAPI_L1_DCM

Compiling with TAU Makefiles • Include TAU Stub Makefile (<arch>/lib) in the user’s Makefile • Variables: • TAU_CXX Specify the C++ compiler used by TAU • TAU_CC, TAU_F90 Specify the C, F90 compilers • TAU_DEFS Defines used by TAU. Add to CFLAGS • TAU_LDFLAGS Linker options. Add to LDFLAGS • TAU_INCLUDE Header files include path. Add to CFLAGS • TAU_LIBS Statically linked TAU library. Add to LIBS • TAU_SHLIBS Dynamically linked TAU library • TAU_MPI_LIBS TAU’s MPI wrapper library for C/C++ • TAU_MPI_FLIBS TAU’s MPI wrapper library for F90 • TAU_FORTRANLIBS Must be linked in with C++ linker for F90. • TAU_DISABLE TAU’s dummy F90 stub library

TAU Analysis • Parallel profile analysis • Pprof • parallel profiler with text-based display • Racy • graphical interface to pprof (Tcl/Tk) • jRacy • Java implementation of Racy • Trace analysis and visualization • Trace merging and clock adjustment (if necessary) • Trace format conversion (ALOG, SDDF, VTF, Paraver) • Trace visualization using Vampir (Pallas)

Pprof Command • pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes] • -c Sort according to number of calls • -b Sort according to number of subroutines called • -m Sort according to msecs (exclusive time total) • -t Sort according to total msecs (inclusive time total) • -e Sort according to exclusive time per call • -i Sort according to inclusive time per call • -v Sort according to standard deviation (exclusive usec) • -r Reverse sorting order • -s Print only summary profile information • -n num Print only first number of functions • -f file Specify full path and filename without node ids • -l nodes List all functions and exit (prints only info about all contexts/threads of given node numbers)

Pprof Output (NAS Parallel Benchmark – LU) • Intel QuadPIII Xeon • F90 + MPICH • Profile - Node - Context - Thread • Events - code - MPI

jRacy (NAS Parallel Benchmark – LU) Routine profile across all nodes n: node c: context t: thread Global profiles Event legend Individual profile

TAU + PAPI (NAS Parallel Benchmark – LU ) • Floating point operations • Re-link to alternate library • Can use multiple counter support

TAU + Vampir (NAS Parallel Benchmark – LU) Callgraph display Timeline display Parallelism display Communications display

tau_reduce Example • tau_reduce implements overhead reduction in TAU • Consider klargest example • Find kth largest element in a N elements • Compare two methods: quicksort, select_kth_largest • Un-instrumented testcase: i = 2324, N = 1000000 • quicksort: (wall clock) = 0.188511 secs • select_kth_largest: (wall clock) = 0.149594 secs • Total: (PIII/1.2GHz time) = 0.340u 0.020s 0:00.37 • Execute with all routines instrumented • Execute with rule-based selective instrumentation usec>1000 & numcalls>400000 & usecs/call<30 & percent>25

Simple sorting example on one processor Before selective instrumentation reduction NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec msec usec/call --------------------------------------------------------------------------------------- 100.0 13 4,982 1 4 4982030 int main 93.5 3,223 4,659 4.20241E+06 1.40268E+07 1 void quicksort 62.9 0.00481 3,134 5 5 626839 int kth_largest_qs 36.4 137 1,813 28 450057 64769 int select_kth_largest 33.6 150 1,675 449978 449978 4 void sort_5elements 28.8 1,435 1,435 1.02744E+07 0 0 void interchange 0.4 20 20 1 0 20668 void setup 0.0 0.0118 0.0118 49 0 0 int ceil After selective instrumentation reduction NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 14 383 1 4 383333 int main 50.9 195 195 5 0 39017 int kth_largest_qs 40.0 153 153 28 79 5478 int select_kth_largest 5.4 20 20 1 0 20611 void setup 0.0 0.02 0.02 49 0 0 int ceil

TAU Performance System Status • Computing platforms • IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64), HP Superdome (HP-UX), Sun, Hitachi SR8000, NEX SX-5 (SX-6 underway), Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power), Apple (OS X), Windows • Programming languages • C, C++, Fortran 77, F90, HPF, Java, OpenMP, Python • Communication libraries • MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava • Thread libraries • pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS

TAU Performance System Status (continued) • Compilers • Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC, Intel • Application libraries (selected) • Blitz++, A++/P++, PETSc, SAMRAI, Overture, PAWS • Application frameworks (selected) • POOMA, MC++, Conejo, Uintah, VTF, UPS, GrACE • Performance projects using TAU • Aurora / SCALEA: ACPC, University of Vienna • TAU full distribution(Version 2.12, web download) • TAU performance system toolkit and user’s guide • Automatic software installation and examples

PDT Status • Program Database Toolkit (Version 2.2, web download) • EDG C++ front end (Version 2.45.2) • Mutek Fortran 90 front end (Version 2.4.1) • C++ and Fortran 90 IL Analyzer • DUCTAPE library • Standard C++ system header files (KCC Version 4.0f) • PDT-constructed tools • TAU instrumentor (C/C++/F90) • Program analysis support for SILOON and CHASM • Platforms • Same as for TAU with a few exceptions

Performance Mapping • High-level semantic abstractions • Associate performance measurements • Performance mapping • performance measurement system support to assign data correctly

… Semantic Entities/Attributes/Associations • New dynamic mapping scheme (SEAA) • Contrast with ParaMap (Miller and Irvin) • Entities defined at any level of abstraction • Attribute entity with semantic information • Entity-to-entity associations • Two association types (implemented in TAU API) • Embedded – extends associatedobject to store performancemeasurement entity • External – creates an external look-uptable using address of object as key tolocate performance measurement entity

Hypothetical Mapping Example • Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] = ... f(face); ... } last+= particles_on_this_face; } }

Hypothetical Mapping Example (continued) • How much time is spent processing face i particles? • What is the distribution of performance among faces? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } work packets … engine

Typical performance tools report performance with respect to routines Does not provide support for mapping Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions No Performance Mapping versus Mapping TAU (w/ mapping) TAU (no mapping)

Performance Mapping in Callpath Profiling • Consider callgraph (callpath) profiling • Measure time (metric) along an edge (path) of callgraph • Incident edge gives parent / child view • Edge sequence (path) gives parent / descendant view • Callpath profiling when callgraph is unknown • Must determine callgraph dynamically at runtime • Map performance measurement to dynamic call path state • Callpath levels • 0-level: current callgraph node • 1-level: immediate parent (descendant) • k-level: kth calling parent (call descendant)

1-Level Callpath Implementation in TAU • TAU maintains a performance event (routine) callstack • Profiled routine (child) looks in callstack for parent • Previous profiled performance event is the parent • A callpath profile structure created first time parent calls • TAU records parent in a callgraph map for child • String representing 1-level callpath used as its key • “a( )=>b( )” : name for time spent in “b” when called by “a” • Map returns pointer to callpath profile structure • 1-level callpath is profiled using this profiling data • Build upon TAU’s performance mapping technology • Measurement is independent of instrumentation • Use –PROFILECALLPATH to configure TAU

The TAU Performance System