Tau evaluation report
1 / 28

TAU Evaluation Report - PowerPoint PPT Presentation

  • Uploaded on

TAU Evaluation Report . Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'TAU Evaluation Report' - Antony

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tau evaluation report l.jpg

TAU Evaluation Report

Adam Leko,

Hung-Hsun Su

UPC Group

HCS Research Laboratory

University of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

Basic information l.jpg
Basic Information

  • Name: Tuning and Analysis Utilities (TAU)

  • Developer: University of Oregon

  • Current version:

    • TAU 2.14.4

    • Program database toolkit 3.3.1

  • Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/

  • Contact:

    • Sameer Shende: sameer@cs.uoregon.edu

Tau overview l.jpg
TAU Overview

  • Performance tool suite that offers profiling and tracing of programs

    • Available instrumentation methods: source (manual), source (automatic), binary (DynInst)

    • Supported languages: C, C++, Fortran, Python, Java, SHMEM (TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm

    • Hardware counter support

  • Relies on existing toolkits and libraries for some functionality

    • PDToolkit and Opari for automatic source instrumentation

    • DynInst for runtime binary instrumentation

    • PCL and PAPI for hardware counter information

    • libvtf3, slog2sdk, and EPILOG for exporting trace files

Configuring installing tau l.jpg
Configuring & Installing TAU

  • TAU relies on several existing toolkits for efficient usage, but some of these toolkits are time-consuming to install

    • PDToolkit, PAPI, etc

  • Users must choose between modes at compile time using ./configure script

    • Profiling via -PROFILE, tracing via -TRACE

    • TAU must also be notified about the location of supported languages and compilers

      • -mpilib=/path/to/mpi/lib

      • -dyninst=/path/to/dyninst

      • -pdt=/path/to/pdt

      • Other supported languages/libraries handled in a similar manner

  • This results in a very flexible installation process

    • Users can easily install different configurations of TAU in their home directory

    • However, several configuration options are mutually exclusive, such as

      • Profiling and tracing

      • Using PAPI counters vs. gettimeofday or TSC counters

      • Profiling w/callpaths vs. profiling with extra statistics

    • Unfortunately, mutually exclusive nature of things proves to be annoying

      • Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice!

      • Luckily, software compiles quickly on modern machines, so this is not fatal

      • However, TAU relies on several environment variables, which makes switching between installations cumbersome

The many faces of tau l.jpg
The Many Faces of TAU

  • Two main methods of operation: profiling and tracing

  • Profile mode

    • Reports aggregate spent in each function per each node/thread

    • Several profile recording options

      • Report min/max/std. dev of times using the -TRACESTATS configure option

      • Attempt to compensate for profiling overhead (-COMPENSATE)

      • Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM)

      • Stop profiling after a certain function depth (-DEPTHLIMIT)

      • Record call trees in profile (-PROFILECALLPATH)

      • Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of phases)

    • If instrumented code uses the TAU_INIT macros, can also pass arguments to compiled, instrumented program to restrict what is recorded at runtime

      • --profile main+func2

    • Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL)

    • Data visualized using pprof (text-based) or paraprof (Java-based GUI)

    • Profile data can be exported to KOJAK’s cube viewer

    • Profile data can be imported from Vampir VTF traces

The many faces of tau 2 l.jpg
The Many Faces of TAU (2)

  • Trace mode

    • Records timestamps for function entry/exit points

      • Or arbitrary code section points via manual instrumentation

    • Also records messages sent/received for MPI programs

    • No trace visualizer, but can export to

      • ALOG: Upshot/nupshot

      • Paraver’s trace format

      • SLOG-2: Jumpshot

      • VTF: Vampir/Intel Trace Analyzer 5

      • SDDF: Format used by Pablo/SvPablo

      • EPILOG: KOJAK’s trace format

Tau instrumentation profile mode l.jpg
TAU Instrumentation: Profile Mode

  • Source-level instrumentation

    • tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files

    • For OpenMP code, TAU can use OPARI (from KOJAK)

    • Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START, TAU_PROFILE_STOP)

    • When compiling, must use stub Makefiles which define compilation macros like CFLAGS, LDFLAGS, etc.

      • This can complicate the compile & link cycle greatly, especially if fully automatic source instrumentation is desired

    • Selective instrumentation is supported through a flag to tau_instrument

      • Give a file containing which functions to include or exclude from instrumentation

      • Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain criteria, like

        • numcalls > 10000 & usecs/call < 2

  • Binary-level instrumentation

    • Based on DynInst, considered “experimental” according to documentation

    • Use tau_run wrapper script with instrumentation file in same format as selective instrumentation file

Tau instrumentation trace mode l.jpg
TAU Instrumentation: Trace Mode

  • Source-level instrumentation

    • Same procedure as in profile mode

  • Binary instrumentation

    • Can link against MPI wrapper library (only re-linking necessary)

    • Runtime instrumentation for trace mode is not supported using DynInst

Instrumentation test suite problems l.jpg
Instrumentation Test Suite: Problems

  • Problem with using selective instrumentation + MPI wrapper library + PAPI metrics

    • Only instrumenting main in CAMEL caused several floating point instructions to be attributed to MPI_Send and MPI_Recv instead of main

    • For timing measurements and overhead measurements, used wallclock time with the low-overhead -LINUXTIMERS option

  • Some code had to be modified before feeding it through PDToolkit’s cparse

    • cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers

    • ANSI C/standard Fortran code poses no problems, though

  • NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries

    • Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing)

    • Modified, updated version (3.2) of LU comes with TAU

      • Had problems compiling and running this

    • Gave TAU the benefit of the doubt for the rest of the evaluations

      • Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests

      • LU timing overheads omitted from overhead measurements

Instrumentation overhead notes l.jpg
Instrumentation Overhead: Notes

  • Performed automatic instrumentation of CAMEL using tau_instrument

    • Like KOJAK, program execution time was several orders of magnitude slower

    • This is likely due to the use of very small functions which normally get inlined by the compiler

    • For profile measurements on the following slides, only main was instrumented

      • Under this scenario, profiling and tracing overhead was almost nonexistent (<1%)

  • Instrumentation points chosen for overhead measurements

    • Profiling

      • CAMEL: all MPI calls, main enter + exit

      • PPerfMark suite: all MPI calls, all function calls

      • Used –PROFILECALLPATH configuration option

        • Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead

    • Tracing

      • CAMEL: all MPI calls

      • PPerfMark suite: all MPI calls

      • Similar to what we have done for other tools

  • Benchmarks marked with * had high variability in runtimes

Instrumentation overhead notes 2 l.jpg
Instrumentation Overhead: Notes (2)

  • Used LAM for all measurements

    • Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH

      • Small messages: 54.2% vs. 483.316%

      • Wrong way: 24.5% vs. 28.573%

      • Ping-pong: 51.5% vs. 56.259%

    • Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file

      • Same I/O time, smaller execution time -> higher % overhead

  • In general, overhead for profiling and tracing extremely low except for a few cases

    • High profile overhead programs with small functions that get called a lot

      • small-messages, wrong-way, ping-pong, CAMEL with everything instrumented

    • High trace overhead for programs with large traces generated very quickly

      • small-messages, wrong-way, ping-pong

    • tau_reduce provides a nice way to help reduce instrumentation overhead, although an initial profile must be first gathered

Visualizations pprof l.jpg
Visualizations: pprof

  • Gives text-based dump of profile files, similar to gprof/prof output

  • Example (partial) output:



NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name


4 2016 4 516 866.2 Message size received from all nodes

86 28 28 28 0 Message size sent to all nodes




%Time Exclusive Inclusive #Call #Subrs Inclusive Name

msec total msec usec/call


100.0 5:44.295 5:45.407 8 1440 43175935 main()

0.2 541 553 8 232 69161 MPI_Init()

0.2 541 553 8 232 69161 main() => MPI_Init()

0.2 543 543 704 0 772 MPI_Recv()

0.2 543 543 704 0 772 main() => MPI_Recv()

0.0 9 9 8 0 1136 MPI_Barrier()

Visualizations paraprof l.jpg
Visualizations: paraprof

  • paraprof provides visual representations of same data given by pprof

  • Used to be a Tcl/Tk application known as “racy”

    • Racy has been deprecated, but is still included with TAU for historical reasons

  • Java application with three main views

    • Main profile view

    • Histograms (next slides)

    • Three-dimensional visualization (next slides)

  • Main profile view (right top)

    • “Function ledger” maps colors to function names (right, bottom left)

    • Overall time for each function displayed as a stacked bar chart

    • Can click on each function to get detailed information (right, bottom right)

  • No line-level source code correlation

    • Can infer this information Indirectly if call paths are used

Main profile view

Function ledger

Function details view

Visualizations paraprof 2 l.jpg
Visualizations: paraprof (2)

  • paraprof can also show histogram views for each function of the main profile view (right)

  • Simply show histogram of aggregate time for a function across all threads

    • Histogram to right shows that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier

Visualizations paraprof 3 l.jpg
Visualizations: paraprof (3)

  • paraprof also can display three-dimensional displays of profile data

  • Bar and triangle meshes axes

    • Time spent in each function (height)

    • Which function (width)

    • Which node (depth)

  • Scatter plot lets you pick axes

  • Plots support transparency, rotation, and highlighting a particular function or node

  • Surprisingly responsive for a Java application!

Bottleneck identification test suite l.jpg
Bottleneck Identification Test Suite

  • Testing metric: what did pprof/paraprof tell us from wallclock time profiles?

    • Since no built-in trace visualizer, we ignored what could be done with other trace tools

  • Programs correctness not affected by instrumentation 

    • Except for our version of LU 


    • Showed work evenly distributed among nodes

    • When full tracing used, can easily show which functions take the most wall clock time


    • Could not run, got segfaults using MPICH or LAM

    • Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views

  • Big messages: PASSED

    • Profile showed most of application time dominated by MPI calls to send and receive

  • Diffuse procedure: TOSS-UP

    • Profile showed most time taken by MPI_Barrier calls

    • However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a negligible amount of overall time

    • Really need a trace view to see diffuse behavior of program

  • Hot procedure: PASSED

    • Profile clearly shows that one function is responsible for most execution time

Bottleneck identification test suite 2 l.jpg
Bottleneck Identification Test Suite (2)

  • Intensive server: PASSED

    • Profile showed most time spent in MPI_Recv for all nodes except first node

    • Profile also illustrated most time for first node spent in waste_time

  • Ping-pong: PASSED

    • Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv

    • pprof and paraprof also showed a large number of MPI calls

  • Random barrier: TOSS-UP

    • Profile showed most time being spent in MPI_Barrier

    • However, random nature of barrier not shown by profile

    • Trace view is necessary to see random barrier behavior

  • Small messages: PASS

    • Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv

  • System time: FAILED

    • No built-in way to separate wall clock time into system time vs. user time

    • PAPI metrics can’t record system time vs. user time either

  • Wrong order: FAILED

    • Impossible to see communication behavior without a trace

Tau general comments l.jpg
TAU General Comments

  • Good things

    • Supports profiling & tracing

    • Very portable

    • Wide range of software support

      • Several programming models & libraries supported

    • Visualization tools seem very stable

    • Good support for exporting data to other tools

  • Things that could use improvement

    • Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes installation difficult

    • Source code correlation could be better

      • Only at the function or function call level (with call paths)

    • Export is nice, but lots of things are easier to do directly in other tools

      • For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, …

      • TAU does add automatic instrumentation for profiling functions, which is an added benefit

    • Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data in a very concise manner

      • Text is also hard to read on three-dimensional views for function names

    • Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested

  • TAU could potentially serve as a base for our UPC and SHMEM performance tool

Tau adding upc shmem l.jpg


    • Not much extra work needed

    • Have already created weak binding patches for GPSHMEM & created a wrapper library that calls the appropriate TAU functions

  • UPC

    • If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places

    • If we do binary instrumentation, we’ll probably have to make major modifications to DynInst

    • In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard

      • However, how to instrument UPC programs while retaining low overhead?

      • Also, how to extend TAU to support more advanced analyses?

  • Support for profiles and traces a nice bonus

Evaluation 1 l.jpg
Evaluation (1)

  • Available metrics: 4/5

    • Supports recording execution time (broken down into call trees)

    • Supports several methods of gathering profile data

    • Supports all PAPI metrics for profiles

  • Cost: 5/5

    • Free!

  • Documentation quality: 3.5/5

    • User’s manual very good, but out of date

    • For example, three-dimensional visualizations not covered in manual

  • Extensibility: 4/5

    • Open source, uses documented APIs

    • Can add support for new languages using source instrumentation

  • Filtering and aggregation: 2.5/5

    • Filtering & aggregation available through profile view

    • No advanced filter or custom aggregation methods built in for traces

Evaluation 2 l.jpg
Evaluation (2)

  • Hardware support: 5/5

    • Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX)

  • Heterogeneity support: 0/5(not supported)

  • Installation: 2.5/5

    • As simple as ./configure with options, then make install

    • However, dependence on other software for source or binary instrumentation makes installation time-consuming

  • Interoperability: 5/5

    • Profile files use simple ASCII format; trace files use documented binary format

    • Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver

  • Learning curve: 2.5/5

    • Learning how to use the different Makefile wrappers and command-line programs takes a while

    • After a short period, instrumentation & tool usage relatively easy

Evaluation 3 l.jpg
Evaluation (3)

  • Manual overhead: 4/5

    • Automatic instrumentation of MPI calls on all platforms

    • Automatic instrumentation of all functions or a selected group of functions

    • Call path support gives almost the same information as instrumenting call sites

    • MPI and OpenMP instrumentation support

  • Measurement accuracy: 5/5

    • CAMEL overhead < 1% for profiling and tracing when a few functions were instrumented

    • Overall, accuracy pretty good except for a few cases

  • Multiple executions: 3/5

    • Can relate profile metrics between runs in paraprof

    • Can store performance data in DBMS (PerfDB)

      • Seems like PerfDB is in a preliminary state, though

  • Multiple analyses & views: 4/5

    • Both profiling and tracing are supported (although no built-in trace viewer)

    • Profile view has stacked bar charts, “regular” views, three-dimensional views, and histograms

Evaluation 4 l.jpg
Evaluation (4)

  • Performance bottleneck identification: 3.5/5

    • No automatic bottleneck identification

    • Profile viewer helpful for identifying methods that take most time

    • Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with several other viewers to cover just about anything

  • Profiling/tracing support: 4/5

    • Tracing & profiling supported

    • Default trace file format size reasonable but not most compact

  • Response time: 3/5

    • Loading profiles after run almost instantaneous using paraprof viewer

    • Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra disk I/Os)

  • Software support: 5/5

    • Supports OpenMP, MPI, and several other programming models

    • A wide range of compilers are supported

    • Can support linking against any library, but does not instrument library functions

  • Source code correlation: 2/5

    • Supported down to the function and function call site level (when collecting call paths is enabled)

  • Searching: 0/5 (not supported)

Evaluation 5 l.jpg
Evaluation (5)

  • System stability: 3/5

    • Software is generally stable

    • Bugs encountered:

      • Segfaults on instrumented version of our LU code

      • SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event” messages on a few exported traces)

      • Exporting to ALOG format puts stray “: %d” lines in ALOG file

  • Technical support: 5/5

    • Good response from our contact (Sameer), most emails answered within 48 hours with useful information