tau evaluation report
Skip this Video
Download Presentation
TAU Evaluation Report

Loading in 2 Seconds...

play fullscreen
1 / 28

TAU Evaluation Report - PowerPoint PPT Presentation

  • Uploaded on

TAU Evaluation Report . Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'TAU Evaluation Report' - Antony

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tau evaluation report

TAU Evaluation Report

Adam Leko,

Hung-Hsun Su

UPC Group

HCS Research Laboratory

University of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

basic information
Basic Information
  • Name: Tuning and Analysis Utilities (TAU)
  • Developer: University of Oregon
  • Current version:
    • TAU 2.14.4
    • Program database toolkit 3.3.1
  • Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/
  • Contact:
tau overview
TAU Overview
  • Performance tool suite that offers profiling and tracing of programs
    • Available instrumentation methods: source (manual), source (automatic), binary (DynInst)
    • Supported languages: C, C++, Fortran, Python, Java, SHMEM (TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm
    • Hardware counter support
  • Relies on existing toolkits and libraries for some functionality
    • PDToolkit and Opari for automatic source instrumentation
    • DynInst for runtime binary instrumentation
    • PCL and PAPI for hardware counter information
    • libvtf3, slog2sdk, and EPILOG for exporting trace files
configuring installing tau
Configuring & Installing TAU
  • TAU relies on several existing toolkits for efficient usage, but some of these toolkits are time-consuming to install
    • PDToolkit, PAPI, etc
  • Users must choose between modes at compile time using ./configure script
    • Profiling via -PROFILE, tracing via -TRACE
    • TAU must also be notified about the location of supported languages and compilers
      • -mpilib=/path/to/mpi/lib
      • -dyninst=/path/to/dyninst
      • -pdt=/path/to/pdt
      • Other supported languages/libraries handled in a similar manner
  • This results in a very flexible installation process
    • Users can easily install different configurations of TAU in their home directory
    • However, several configuration options are mutually exclusive, such as
      • Profiling and tracing
      • Using PAPI counters vs. gettimeofday or TSC counters
      • Profiling w/callpaths vs. profiling with extra statistics
    • Unfortunately, mutually exclusive nature of things proves to be annoying
      • Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice!
      • Luckily, software compiles quickly on modern machines, so this is not fatal
      • However, TAU relies on several environment variables, which makes switching between installations cumbersome
the many faces of tau
The Many Faces of TAU
  • Two main methods of operation: profiling and tracing
  • Profile mode
    • Reports aggregate spent in each function per each node/thread
    • Several profile recording options
      • Report min/max/std. dev of times using the -TRACESTATS configure option
      • Attempt to compensate for profiling overhead (-COMPENSATE)
      • Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM)
      • Stop profiling after a certain function depth (-DEPTHLIMIT)
      • Record call trees in profile (-PROFILECALLPATH)
      • Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of phases)
    • If instrumented code uses the TAU_INIT macros, can also pass arguments to compiled, instrumented program to restrict what is recorded at runtime
      • --profile main+func2
    • Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL)
    • Data visualized using pprof (text-based) or paraprof (Java-based GUI)
    • Profile data can be exported to KOJAK’s cube viewer
    • Profile data can be imported from Vampir VTF traces
the many faces of tau 2
The Many Faces of TAU (2)
  • Trace mode
    • Records timestamps for function entry/exit points
      • Or arbitrary code section points via manual instrumentation
    • Also records messages sent/received for MPI programs
    • No trace visualizer, but can export to
      • ALOG: Upshot/nupshot
      • Paraver’s trace format
      • SLOG-2: Jumpshot
      • VTF: Vampir/Intel Trace Analyzer 5
      • SDDF: Format used by Pablo/SvPablo
      • EPILOG: KOJAK’s trace format
tau instrumentation profile mode
TAU Instrumentation: Profile Mode
  • Source-level instrumentation
    • tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files
    • For OpenMP code, TAU can use OPARI (from KOJAK)
    • Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START, TAU_PROFILE_STOP)
    • When compiling, must use stub Makefiles which define compilation macros like CFLAGS, LDFLAGS, etc.
      • This can complicate the compile & link cycle greatly, especially if fully automatic source instrumentation is desired
    • Selective instrumentation is supported through a flag to tau_instrument
      • Give a file containing which functions to include or exclude from instrumentation
      • Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain criteria, like
        • numcalls > 10000 & usecs/call < 2
  • Binary-level instrumentation
    • Based on DynInst, considered “experimental” according to documentation
    • Use tau_run wrapper script with instrumentation file in same format as selective instrumentation file
tau instrumentation trace mode
TAU Instrumentation: Trace Mode
  • Source-level instrumentation
    • Same procedure as in profile mode
  • Binary instrumentation
    • Can link against MPI wrapper library (only re-linking necessary)
    • Runtime instrumentation for trace mode is not supported using DynInst
instrumentation test suite problems
Instrumentation Test Suite: Problems
  • Problem with using selective instrumentation + MPI wrapper library + PAPI metrics
    • Only instrumenting main in CAMEL caused several floating point instructions to be attributed to MPI_Send and MPI_Recv instead of main
    • For timing measurements and overhead measurements, used wallclock time with the low-overhead -LINUXTIMERS option
  • Some code had to be modified before feeding it through PDToolkit’s cparse
    • cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers
    • ANSI C/standard Fortran code poses no problems, though
  • NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries
    • Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing)
    • Modified, updated version (3.2) of LU comes with TAU
      • Had problems compiling and running this
    • Gave TAU the benefit of the doubt for the rest of the evaluations
      • Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests
      • LU timing overheads omitted from overhead measurements
instrumentation overhead notes
Instrumentation Overhead: Notes
  • Performed automatic instrumentation of CAMEL using tau_instrument
    • Like KOJAK, program execution time was several orders of magnitude slower
    • This is likely due to the use of very small functions which normally get inlined by the compiler
    • For profile measurements on the following slides, only main was instrumented
      • Under this scenario, profiling and tracing overhead was almost nonexistent (<1%)
  • Instrumentation points chosen for overhead measurements
    • Profiling
      • CAMEL: all MPI calls, main enter + exit
      • PPerfMark suite: all MPI calls, all function calls
      • Used –PROFILECALLPATH configuration option
        • Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead
    • Tracing
      • CAMEL: all MPI calls
      • PPerfMark suite: all MPI calls
      • Similar to what we have done for other tools
  • Benchmarks marked with * had high variability in runtimes
instrumentation overhead notes 2
Instrumentation Overhead: Notes (2)
  • Used LAM for all measurements
    • Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH
      • Small messages: 54.2% vs. 483.316%
      • Wrong way: 24.5% vs. 28.573%
      • Ping-pong: 51.5% vs. 56.259%
    • Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file
      • Same I/O time, smaller execution time -> higher % overhead
  • In general, overhead for profiling and tracing extremely low except for a few cases
    • High profile overhead programs with small functions that get called a lot
      • small-messages, wrong-way, ping-pong, CAMEL with everything instrumented
    • High trace overhead for programs with large traces generated very quickly
      • small-messages, wrong-way, ping-pong
    • tau_reduce provides a nice way to help reduce instrumentation overhead, although an initial profile must be first gathered
visualizations pprof
Visualizations: pprof
  • Gives text-based dump of profile files, similar to gprof/prof output
  • Example (partial) output:



NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name


4 2016 4 516 866.2 Message size received from all nodes

86 28 28 28 0 Message size sent to all nodes




%Time Exclusive Inclusive #Call #Subrs Inclusive Name

msec total msec usec/call


100.0 5:44.295 5:45.407 8 1440 43175935 main()

0.2 541 553 8 232 69161 MPI_Init()

0.2 541 553 8 232 69161 main() => MPI_Init()

0.2 543 543 704 0 772 MPI_Recv()

0.2 543 543 704 0 772 main() => MPI_Recv()

0.0 9 9 8 0 1136 MPI_Barrier()

visualizations paraprof
Visualizations: paraprof
  • paraprof provides visual representations of same data given by pprof
  • Used to be a Tcl/Tk application known as “racy”
    • Racy has been deprecated, but is still included with TAU for historical reasons
  • Java application with three main views
    • Main profile view
    • Histograms (next slides)
    • Three-dimensional visualization (next slides)
  • Main profile view (right top)
    • “Function ledger” maps colors to function names (right, bottom left)
    • Overall time for each function displayed as a stacked bar chart
    • Can click on each function to get detailed information (right, bottom right)
  • No line-level source code correlation
    • Can infer this information Indirectly if call paths are used

Main profile view

Function ledger

Function details view

visualizations paraprof 2
Visualizations: paraprof (2)
  • paraprof can also show histogram views for each function of the main profile view (right)
  • Simply show histogram of aggregate time for a function across all threads
    • Histogram to right shows that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier
visualizations paraprof 3
Visualizations: paraprof (3)
  • paraprof also can display three-dimensional displays of profile data
  • Bar and triangle meshes axes
    • Time spent in each function (height)
    • Which function (width)
    • Which node (depth)
  • Scatter plot lets you pick axes
  • Plots support transparency, rotation, and highlighting a particular function or node
  • Surprisingly responsive for a Java application!
bottleneck identification test suite
Bottleneck Identification Test Suite
  • Testing metric: what did pprof/paraprof tell us from wallclock time profiles?
    • Since no built-in trace visualizer, we ignored what could be done with other trace tools
  • Programs correctness not affected by instrumentation 
    • Except for our version of LU 
    • Showed work evenly distributed among nodes
    • When full tracing used, can easily show which functions take the most wall clock time
    • Could not run, got segfaults using MPICH or LAM
    • Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views
  • Big messages: PASSED
    • Profile showed most of application time dominated by MPI calls to send and receive
  • Diffuse procedure: TOSS-UP
    • Profile showed most time taken by MPI_Barrier calls
    • However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a negligible amount of overall time
    • Really need a trace view to see diffuse behavior of program
  • Hot procedure: PASSED
    • Profile clearly shows that one function is responsible for most execution time
bottleneck identification test suite 2
Bottleneck Identification Test Suite (2)
  • Intensive server: PASSED
    • Profile showed most time spent in MPI_Recv for all nodes except first node
    • Profile also illustrated most time for first node spent in waste_time
  • Ping-pong: PASSED
    • Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv
    • pprof and paraprof also showed a large number of MPI calls
  • Random barrier: TOSS-UP
    • Profile showed most time being spent in MPI_Barrier
    • However, random nature of barrier not shown by profile
    • Trace view is necessary to see random barrier behavior
  • Small messages: PASS
    • Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv
  • System time: FAILED
    • No built-in way to separate wall clock time into system time vs. user time
    • PAPI metrics can’t record system time vs. user time either
  • Wrong order: FAILED
    • Impossible to see communication behavior without a trace
tau general comments
TAU General Comments
  • Good things
    • Supports profiling & tracing
    • Very portable
    • Wide range of software support
      • Several programming models & libraries supported
    • Visualization tools seem very stable
    • Good support for exporting data to other tools
  • Things that could use improvement
    • Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes installation difficult
    • Source code correlation could be better
      • Only at the function or function call level (with call paths)
    • Export is nice, but lots of things are easier to do directly in other tools
      • For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, …
      • TAU does add automatic instrumentation for profiling functions, which is an added benefit
    • Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data in a very concise manner
      • Text is also hard to read on three-dimensional views for function names
    • Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested
  • TAU could potentially serve as a base for our UPC and SHMEM performance tool
tau adding upc shmem
    • Not much extra work needed
    • Have already created weak binding patches for GPSHMEM & created a wrapper library that calls the appropriate TAU functions
  • UPC
    • If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places
    • If we do binary instrumentation, we’ll probably have to make major modifications to DynInst
    • In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard
      • However, how to instrument UPC programs while retaining low overhead?
      • Also, how to extend TAU to support more advanced analyses?
  • Support for profiles and traces a nice bonus
evaluation 1
Evaluation (1)
  • Available metrics: 4/5
    • Supports recording execution time (broken down into call trees)
    • Supports several methods of gathering profile data
    • Supports all PAPI metrics for profiles
  • Cost: 5/5
    • Free!
  • Documentation quality: 3.5/5
    • User’s manual very good, but out of date
    • For example, three-dimensional visualizations not covered in manual
  • Extensibility: 4/5
    • Open source, uses documented APIs
    • Can add support for new languages using source instrumentation
  • Filtering and aggregation: 2.5/5
    • Filtering & aggregation available through profile view
    • No advanced filter or custom aggregation methods built in for traces
evaluation 2
Evaluation (2)
  • Hardware support: 5/5
    • Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX)
  • Heterogeneity support: 0/5(not supported)
  • Installation: 2.5/5
    • As simple as ./configure with options, then make install
    • However, dependence on other software for source or binary instrumentation makes installation time-consuming
  • Interoperability: 5/5
    • Profile files use simple ASCII format; trace files use documented binary format
    • Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver
  • Learning curve: 2.5/5
    • Learning how to use the different Makefile wrappers and command-line programs takes a while
    • After a short period, instrumentation & tool usage relatively easy
evaluation 3
Evaluation (3)
  • Manual overhead: 4/5
    • Automatic instrumentation of MPI calls on all platforms
    • Automatic instrumentation of all functions or a selected group of functions
    • Call path support gives almost the same information as instrumenting call sites
    • MPI and OpenMP instrumentation support
  • Measurement accuracy: 5/5
    • CAMEL overhead < 1% for profiling and tracing when a few functions were instrumented
    • Overall, accuracy pretty good except for a few cases
  • Multiple executions: 3/5
    • Can relate profile metrics between runs in paraprof
    • Can store performance data in DBMS (PerfDB)
      • Seems like PerfDB is in a preliminary state, though
  • Multiple analyses & views: 4/5
    • Both profiling and tracing are supported (although no built-in trace viewer)
    • Profile view has stacked bar charts, “regular” views, three-dimensional views, and histograms
evaluation 4
Evaluation (4)
  • Performance bottleneck identification: 3.5/5
    • No automatic bottleneck identification
    • Profile viewer helpful for identifying methods that take most time
    • Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with several other viewers to cover just about anything
  • Profiling/tracing support: 4/5
    • Tracing & profiling supported
    • Default trace file format size reasonable but not most compact
  • Response time: 3/5
    • Loading profiles after run almost instantaneous using paraprof viewer
    • Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra disk I/Os)
  • Software support: 5/5
    • Supports OpenMP, MPI, and several other programming models
    • A wide range of compilers are supported
    • Can support linking against any library, but does not instrument library functions
  • Source code correlation: 2/5
    • Supported down to the function and function call site level (when collecting call paths is enabled)
  • Searching: 0/5 (not supported)
evaluation 5
Evaluation (5)
  • System stability: 3/5
    • Software is generally stable
    • Bugs encountered:
      • Segfaults on instrumented version of our LU code
      • SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event” messages on a few exported traces)
      • Exporting to ALOG format puts stray “: %d” lines in ALOG file
  • Technical support: 5/5
    • Good response from our contact (Sameer), most emails answered within 48 hours with useful information