1 / 41

TAU: A Framework for Parallel Performance Analysis

TAU: A Framework for Parallel Performance Analysis. Allen D. Malony malony@cs.uoregon.edu ParaDucks Research Group Computer & Information Science Department Computational Science Institute University of Oregon. Outline. Goals and challenges Targeted research areas

munson
Download Presentation

TAU: A Framework for Parallel Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAU: A Framework for Parallel Performance Analysis Allen D. Malony malony@cs.uoregon.edu ParaDucks Research Group Computer & Information Science Department Computational Science Institute University of Oregon

  2. Outline • Goals and challenges • Targeted research areas • TAU (Tuning and Analysis Utilities) • computation model, architecture, toolkit framework • performance system technology • examples of TAU use • Tools associated with TAU • PDT (Program Database Toolkit) • distributed runtime monitoring • Future plans • Conclusions Ptools Annual Meeting

  3. Goal and Challenges Create robust (performance) technology for the analysis and tuning of parallel software and systems • Challenges • different scalable computing platforms • different programming languages and systems • common, portable framework for analysis • extensibe, retargetable tool technology • complex set of requirements Ptools Annual Meeting

  4. Targeted Research Areas • Performance analysisfor scalable parallel systems targeting multiple programming and system levelsand the mapping between levels • Program code analysisfor multiple languages enabling development of new source-based tools • Integrationandinteroperation support for building analysis tool frameworks and environments • Runtime tool interactionfor dynamic applications Ptools Annual Meeting

  5. Network TAU (Tuning and Analysis Utilities) • Performance analysis framework for scalable parallel and distributed high-performance computing • Target a general parallel computation model • computer nodes • shared address space contexts • threads of execution • multi-level parallelism • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • portable performance profiling/tracing facility • open software approach Ptools Annual Meeting

  6. TAU Architecture Ptools Annual Meeting

  7. TAU Instrumentation • Flexible, multiple instrumentation mechanisms • source code • manual • automatic using PDT (tau_instrumentor) • object code • pre-instrumented libraries • statically linked: MPI wrapper library using the MPI Profiling Interface (libTauMpi.a) • dynamically linked: Java instrumentation using JVMPI and TAU shared object dynamically loaded in VM • executable code • dynamic instrumentation using DyninstAPI (tau_run) Ptools Annual Meeting

  8. TAU Instrumentation (continued) • Common target measurement interface (TAU API) • C++ (object-based) instrumentation • macro-based, using constructor/destructor techniques • function, classes, and templates • uniquely identify functions and templates • name and type signature (name registration) • static object creates performance entry • dynamic object receives static object pointer • runtime type identification for template instantiations • with C and Fortran instrumentation variants • Instrumentation optimization Ptools Annual Meeting

  9. TAU Measurement • Performance information • high resolution timer library (real-time clock) • generalized software counter library • hardware performance counters • PCL (Performance Counter Library) (ZAM, Germany) • PAPI (Performance API) (UTK, Ptools) • consistent, portable API • Organization • node, context, thread levels • profile groups for collective events (runtime selective) • mapping between software levels Ptools Annual Meeting

  10. TAU Measurement (continued) • Profiling • function-level, block-level, statement-level • supports user-defined events • TAU profile (function) database (PD) • function callstack • hardware counts instead of time • Tracing • profile-level events • interprocess communication events • timestamp synchronization • User-controlled configuration (configure) Ptools Annual Meeting

  11. Timing of Multi-threaded Applications • Capture timing information on per thread basis • Two alternative • wall clock time • works on all systems • user-level measurement • OS-maintained CPU time (e.g., Solaris, Linux) • thread virtual time measurement • TAU supports both alternatives • CPUTIME module profiles user+system time % configure -pthread -CPUTIME Ptools Annual Meeting

  12. TAU Analysis • Profile analysis • pprof • parallel profiler with text-based display • racy • graphical interface to pprof • Trace analysis • trace merging and clock adjustment (if necessary) • trace format conversion (ALOG, SDDF, PV, Vampir) • Vampir • trace analysis and visualization tool (Pallas) Ptools Annual Meeting

  13. TAU Status • Usage • platforms • IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster • languages • C, C++, Fortran 77/90, HPF, pC++, HPC++, Java • communication libraries • MPI, PVM, Nexus, Tulip, ACLMPL • thread libraries • pthreads, Tulip, SMARTS, Java,Windows • compilers • KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray Ptools Annual Meeting

  14. TAU Status (continued) • application libraries • Blitz++, A++/P++, ACLVIS, PAWS • application frameworks • POOMA, POOMA-2, MC++, Conejo, PaRP • other projects • ACPC, University of Vienna: Aurora • UC Berkeley (Culler): Millenium, sensitivity analysis • KAI and Pallas • TAU profiling and tracing toolkit (Version 2.7) • LANL ACL Fall 1999 CD-ROM distributed at SC'99 • Extensive 70-page TAU User’s Guide • http://www.acl.lanl.gov/tau Ptools Annual Meeting

  15. TAU Examples • Instrumentation • C++ template profiling (PETE, Blitz++) • Java and MPI • PAPI • Measurement • mapping of asynchronous execution (SMARTS) • hybrid execution (Opus/HPF) • Analysis • SMARTS scheduling Ptools Annual Meeting

  16. C++ Template Instrumentation (Blitz++, PETE) • High-level objects • array classes • templates • Optimizations • array processing • expressions (PETE) • Relate performance data to high-level statement • Complexity of template evaluation Array expressions Ptools Annual Meeting

  17. Standard Template Instrumentation Difficulties • Instantiated templates result in mangled identifiers • Standard profiling techniques and tools are deficient • integrated with proprietary compilers • specific systems platforms and programming models Uninterpretable routine names Ptools Annual Meeting

  18. TAU Template Instrumentation and Profiling Profile ofexpressiontypes Performance data presentedwith respect to high-levelarray expression types Graphical pprof Ptools Annual Meeting

  19. Parallel Java Performance Instrumentation • Multi-language applications (Java, C, C++, Fortran) • Hybrid execution models (Java threads, MPI) • Java Virtual Machine Profiler Interface (JVMPI) • event instrumentation in JVM • profiler agent (libTAU.so) fields events • Java Native Interface (JNI) • invoke JVMPI control routines to control Java threads and access thread information • MPI profiling interface • “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000. Ptools Annual Meeting

  20. Thread API TAU Java Instrumentation Architecture Java program mpiJava package TAU package JNI MPI profiling interface Event notification TAU TAU wrapper Native MPI library JVMPI Profile DB Ptools Annual Meeting

  21. mpiJava testcase 4 nodes,28 threads Nodeprocessgrouping Threadmessagepairing Vampirdisplay Multi-level event grouping Parallel Java Game of Life Ptools Annual Meeting

  22. TAU and PAPI: NAS Parallel LU Benchmark • SGI Power Onyx (4 processors, R10K), MPI • Floating pointoperations • Cross-nodefull / routineprofiles • Full FPprofile foreach node Percentage profile Ptools Annual Meeting

  23. TAU and PAPI: Matrix Multiply • Data cache miss comparison, • “regular” vs. “strip-mining” execution • 512x51232 KB (P)2 MB (S) • Regularcauses4.5 timesmoremisses Ptools Annual Meeting

  24. Asynchronous Performance Analysis (SMARTS) • Scalable Multithreaded Asynchronuous Runtime System • user-level threads, light-weight virtual processors • macro-dataflow, asynchronous execution interleaving iterates from data-parallel statements • integrated with POOMA II • TAU measurement of asynchronous parallel execution • utilized the TAU mapping API • associate iterate performance with data parallel statement • evaluate different scheduling policies • “SMARTS: Exploting Temporal Locality & Parallelism through Vertical Execution,” ICS '99, August 1999. Ptools Annual Meeting

  25. TAU Mapping of Asynchronous Execution Without mapping Two threadsexecuting With mapping POOMA / SMARTS Ptools Annual Meeting

  26. With and without mapping (Thread 0) Without mapping Thread 0 blockswaiting for iterates Iterates get lumped together With mapping Iterates distinguished Ptools Annual Meeting

  27. With and without mapping (Thread 1) Array initialization performance lumped Without mapping Performance associated with ExpressionKernel object With mapping Iterate performance mapped to array statement Array initialization performancecorrectly separated Ptools Annual Meeting

  28. TAU and Hybrid Execution in Opus/HPF • Fortran 77, Fortran 90, HPF • Vienna Fortran Compiling System • Opus / HPF • combined data (HPF) and task (Opus) parallelism • HPF compiler produces Fortran 90 modules • processes interoperate using Opus runtime system • producer / consumer model • MPI and pthreads • performance influence at multiple software levels Ptools Annual Meeting

  29. TAU Profiling of Opus/HPF Application Multiple producers Multiple consumers Parallelism View Ptools Annual Meeting

  30. TAU Profiling of SMARTS Iteration scheduling for two array expressions Ptools Annual Meeting

  31. SMARTS Tracing (SOR) – Vampir Visualization • SCVE scheduler used in Red/Black SOR running on 32 processors of SGI Origin 2000 Asynchronous, overlapped parallelism Ptools Annual Meeting

  32. Program Database Toolkit (PDT) • Program code analysis framework for developing source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • commercial grade front end parsers • portable IL analyzer, database format, and access API • open software approach for tool development • Target and integrate multiple source languages • http://www.acl.lanl.gov/pdtoolkit Ptools Annual Meeting

  33. PDT Architecture and Tools Ptools Annual Meeting

  34. PDT Summary • Program Database Toolkit (Version 1.1) • LANL ACL Fall 1999 CD-ROM distributed at SC'99 • EDG C++ Front End (Version 2.41.2) • C++ IL Analyzer and DUCTAPE library • tools: pdbmerge, pdbconv, pdbtree, pdbhtml • standard C++ system header files (KAI KCC 3.4c) • Fortran 90 IL Analyzer in progress • Automated TAU performance instrumentation • Program analysis support for SILOON (ACL CD) • “A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software,” submitted to SC ’00. Ptools Annual Meeting

  35. Distributed Monitoring Framework • Extend usability of TAU performance analysis • Access TAU performance data during execution • Framework model • each application context is a performance data server • monitor agent thread is created within each context • client processes attach to agents and request data • server thread synchronization for data consistency • pull mode of interaction • Distributed TAU performance data space • “A Runtime Monitoring Framework for the TAU Profiling System,” ISCOPE ’99, Nov. 1999. Ptools Annual Meeting

  36. TAU Distributed Monitor Architecture TAU profile database • Each context has a monitor agent • Client in separatethread directs agent • Pull model ofinteraction • Initial HPC++implementation Ptools Annual Meeting

  37. Java Implementation of TAU Monitor • Motivations • more portable monitor middleware system (RMI) • more flexible and programmable server interface (JNI) • more robust client development (EJB, JDBC, Swing) Ptools Annual Meeting

  38. Future Plans • TAU • platforms: SGI Itanium, Sun Starfire, IBM Linux, ... • languages: Java (Java Grande) , OpenMP • instrument: automatic (F90, Java), Dyninst • measurement: hardware counter, support PAPI • displays: “beyond bargraphs” performance views • performance database and technology • support for multiple runs • open API for analysis tool development • PDT • complete F90 and Java IL Analyzer • source browsers: function, class, template • tools for aiding in data marshalling and translation Ptools Annual Meeting

  39. Future Plans (continued) • Distributed monitoring framework • application and system monitoring • ACL Supermon and SGI Performance Co-Pilot • scalable SMP clusters and distributed systems • performance monitoring clients • Performance evaluation • numerical libraries and frameworks • scalable runtime systems • ASCI application developers (benchmark codes) • Investigate performance issues in Linux kernel • Investigate integration with CCA Ptools Annual Meeting

  40. Conclusions • Complex parallel computing environments require robust program analysis tools • portable, cross-platform, multi-level, integrated • able to bridge and reuse existing technology • technology savvy • TAU offers a robust performance technology framework for complex parallel computing systems • flexible instrumentation and instrumentation • extendable profile and trace performance analysis • integration with other performance technology • Opportunities exist for open performance technology Ptools Annual Meeting

  41. Open Performance Technology (OPT) • Performance problem is complex • diverse platforms, software development, applications • things evolve • History of incompatible and competing tools • instrumentation / measurement technology reinvention • lack of common, reusable software foundations • Need “value added” (open) approach • technology for high-level performance tool development • layered performance tool architecture • portable, flexible, programmable, integrative technology • Opportunity for Industry/National Labs/PACI sites Ptools Annual Meeting

More Related