1 / 74

Tool Visualizations, Metrics, and Profiled Entities Overview

Tool Visualizations, Metrics, and Profiled Entities Overview . Adam Leko HCS Research Laboratory University of Florida. Summary. Give characteristics of existing tools to aid our design discussions Metrics (what is recorded, any hardware counters, etc) Profiled entities Visualizations

Download Presentation

Tool Visualizations, Metrics, and Profiled Entities Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida

  2. Summary • Give characteristics of existing tools to aid our design discussions • Metrics (what is recorded, any hardware counters, etc) • Profiled entities • Visualizations • Most information & some slides taken from tool evaluations • Tools overviewed • TAU • Paradyn • MPE/Jumpshot • Dimemas/Paraver/MPITrace • mpiP • Dynaprof • KOJAK • Intel Cluster Tools (old Vampir/VampirTrace) • Pablo • MPICL/Paragraph

  3. TAU • Metrics recorded • Two modes: profile, trace • Profile mode • Inclusive/exclusive time spent in functions • Hardware counter information • PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses, cycles, integer/floating point/load/store/stalls executed, wall clock time, virtual time • Other OS timers (gettimeofday, getrusage) • MPI message size sent • Trace mode • Same as profile (minus hardware counters?) • Message send time, message receive time, message size, message sender/recipient(?) • Profiled entities • Functions (automatic & dynamic), loops + regions (manual instrumentation)

  4. TAU • Visualizations • Profile mode • Text-based: pprof (example next slide), shows a summary of profile information • Graphical: racy (old), jracy a.k.a. paraprof • Trace mode • No built-in visualizations • Can export to CUBE (see KOJAK), Jumpshot (see MPE), and Vampir format (see Intel Cluster Tools)

  5. TAU – pprof output Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 0.207 20,011 1 2 20011689 main() (calls f1, f5) 75.0 1,001 15,009 1 2 15009904 f1() (sleeps 1 sec, calls f2, f4) 75.0 1,001 15,009 1 2 15009904 main() (calls f1, f5) => f1() (sleeps 1 sec, calls f2, f4) 50.0 4,003 10,007 2 2 5003524 f2() (sleeps 2 sec, calls f3) 45.0 4,001 9,005 1 1 9005230 f1() (sleeps 1 sec, calls f2, f4) => f4() (sleeps 4 sec, calls f2) 45.0 4,001 9,005 1 1 9005230 f4() (sleeps 4 sec, calls f2) 30.0 6,003 6,003 2 0 3001710 f2() (sleeps 2 sec, calls f3) => f3() (sleeps 3 sec) 30.0 6,003 6,003 2 0 3001710 f3() (sleeps 3 sec) 25.0 2,001 5,003 1 1 5003546 f4() (sleeps 4 sec, calls f2) => f2() (sleeps 2 sec, calls f3) 25.0 2,001 5,003 1 1 5003502 f1() (sleeps 1 sec, calls f2, f4) => f2() (sleeps 2 sec, calls f3) 25.0 5,001 5,001 1 0 5001578 f5() (sleeps 5 sec) 25.0 5,001 5,001 1 0 5001578 main() (calls f1, f5) => f5() (sleeps 5 sec)

  6. TAU – paraprof

  7. Paradyn • Metrics recorded • Number of CPUs, number of active threads, CPU and inclusive CPU time • Function calls to and by • Synchronization (# operations, wait time, inclusive wait time) • Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received) • I/O (# operations, wait time, inclusive wait time, total bytes) • All metrics recorded as “time histograms” (fixed-size data structure) • Profiled entities • Functions only (but includes functions linked to in existing libraries)

  8. Paradyn • Visualizations • Time histograms • Tables • Barcharts • “Terrains” (3-D histograms)

  9. Paradyn – time histograms

  10. Paradyn – terrain plot (histogram across multiple hosts)

  11. Paradyn – table (current metric values)

  12. Paradyn – bar chart (current or average metric values)

  13. MPE/Jumpshot • Metrics collected • MPI message send time, receive time, size, message sender/recipient • User-defined event entry & exit • Profiled entities • All MPI functions • Functions or regions via manual instrumentation and custom events • Visualization • Jumpshot: timeline view (space-time diagram overlaid on Gantt chart), histogram

  14. Jumpshot – timeline view

  15. Jumpshot – histogram view

  16. Dimemas/Paraver/MPITrace • Counter 2 • Cycles • Graduated instructions, loads, stores, store conditionals, floating point instructions • TLB misses • Mispredicted branches • Primary/secondary data cache miss rates • Data mispredictions from scache way prediction table(?) • External intervention/invalidation (cache coherency?) • Store/prefetch exclusive to clean/shared block • Metrics recorded (MPITrace) • All MPI functions • Hardware counters (2 from the following two lists, uses PAPI) • Counter 1 • Cycles • Issued instructions, loads, stores, store conditionals • Failed store conditionals • Decoded branches • Quadwords written back from scache(?) • Correctible scache data array errors(?) • Primary/secondary I-cache misses • Instructions mispredicted from scache way prediction table(?) • External interventions (cache coherency?) • External invalidations (cache coherency?) • Graduated instructions

  17. Dimemas/Paraver/MPITrace • Profiled entities (MPITrace) • All MPI functions (message start time, message end time, message size, message recipient/sender) • User regions/functions via manual instrumentation • Visualization • Timeline display (like Jumpshot) • Shows Gantt chart and messages • Also can overlay hardware counter information • Clicking on timeline brings up a text listing of events near where you clicked • 1D/2D analysis modules

  18. Paraver – timeline (standard)

  19. Paraver – timeline (HW counter)

  20. Paraver – text module

  21. Paraver – 1D analysis

  22. Paraver – 2D analysis

  23. mpiP • Metrics collected • Start time, end time, message size for each MPI call • Profiled entities • MPI function calls + PMPI wrapper • Visualization • Text-based output, with graphical browser that displays statistics in-line with source • Displayed information: • Overall time (%) for each MPI node • Top 20 callsites for time (MPI%, App%, variance) • Top 20 callsites for message size (MPI%, App%, variance) • Min/max/average/MPI%/App% time spent at each call site • Min/max/average/sum of message sizes at each call site • App time = wall clock time between MPI_Init and MPI_Finalize • MPI time = all time consumed by MPI functions • App% = % of metric in relation to overall app time • MPI% = % of metric in relation to overall MPI time

  24. mpiP – graphical view

  25. Dynaprof • Metrics collected • Wall clock time or PAPI metric for each profiled entity • Collects inclusive, exclusive, and 1-level call tree % information • Profiled entities • Functions (dynamic instrumentation) • Visualizations • Simple text-based • Simple GUI (shows same info as text-based)

  26. Dynaprof – output [leko@eta-1 dynaprof]$ wallclockrpt lu-1.wallclock.16143 Exclusive Profile. Name Percent Total Calls ------------- ------- ----- ------- TOTAL 100 1.436e+11 1 unknown 100 1.436e+11 1 main 3.837e-06 5511 1 Inclusive Profile. Name Percent Total SubCalls ------------- ------- ----- ------- TOTAL 100 1.436e+11 0 main 100 1.436e+11 5 1-Level Inclusive Call Tree. Parent/-Child Percent Total Calls ------------- ------- ----- -------- TOTAL 100 1.436e+11 1 main 100 1.436e+11 1 - f_setarg.0 1.414e-05 2.03e+04 1 - f_setsig.1 1.324e-05 1.902e+04 1 - f_init.2 2.569e-05 3.691e+04 1 - atexit.3 7.042e-06 1.012e+04 1 - MAIN__.4 0 0 1

  27. KOJAK • Metrics collected • MPI: message start time, receive time, size, message sender/recipient • Manual instrumentation: start and stop times • 1 PAPI metric / run (only FLOPS and L1 data misses visualized) • Profiled entities • MPI calls (MPI wrapper library) • Function calls (automatic instrumentation, only available on a few platforms) • Regions and function calls via manual instrumentation • Visualizations • Can export traces to Vampir trace format (see ICT) • Shows profile and analyzed data via CUBE (described on next few slides)

  28. CUBE overview: simple description • Uses a 3-pane approach to display information • Metric pane • Module/calltree pane • Right-clicking brings up source code location • Location pane (system tree) • Each item is displayed along with a color to indicate severity of condition • Severity can be expressed 4 ways • Absolute (time) • Percentage • Relative percentage (changes module & location pane) • Comparative percentage (differences between executions) • Despite documentation, interface is actually quite intuitive

  29. CUBE example: CAMEL After opening the .cube file (default metric shown = absolute time take in seconds)

  30. CUBE example: CAMEL After expanding all 3 root nodes; color shown indicates metric “severity” (amount of time)

  31. CUBE example: CAMEL Selecting “Execution” shows execution time, broken down into part of code & machine

  32. CUBE example: CAMEL Selecting mainloop adjusts system tree to only show time spent in mainloop per each processor

  33. CUBE example: CAMEL Expanded nodes show exclusive metric (only time spent by node)

  34. CUBE example: CAMEL Collapsed nodes show inclusive metric (time spent by node and all children nodes)

  35. CUBE example: CAMEL Metric pane also shows detected bottlenecks; here, shows “Late Sender” in MPI_Recv within main spread across all nodes

  36. Intel Cluster Tools (ICT) • Metrics collected • MPI functions: start time, end time, message size, message sender/recipient • User-defined events: counter, start & end times • Code location for source-code correlation • Instrumented entities • MPI functions via wrapper library • User functions via binary instrumentation(?) • User functions & regions via manual instrumentation • Visualizations • Different types: timelines, statistics & counter info • Described in next slides

  37. ICT visualizations – timelines & summaries • Summary Chart Display • Allows the user to see how much work is spent in MPI calls • Timeline Display • Zoomable, scrollable timeline representation of program execution Fig. 1 Summary Chart Fig. 2 Timeline Display

  38. ICT visualizations – histogram & counters • Summary Timeline • Timeline/histogram representation showing the number of processes in each activity per time bin • Counter Timeline • Value over time representation (behavior depends on counter definition in trace) Fig. 3 Summary TImeline Fig 4. Counter Timeline

  39. ICT visualizations – message stats & process profiles • Message Statistics Display • Message data to/from each process (count,length, rate, duration) • Process Profile Display • Per process data regarding activities Fig. 5 Message Statistics Fig. 6 Process Profile Display

  40. ICT visualizations – general stats & call tree • Statistics Display • Various statistics regarding activities in histogram, table, or text format • Call Tree Display Fig. 7 Statistics Display Fig. 8 Call Tree Display

  41. ICT visualizations – source & activity chart • Source View • Source code correlation with events in Timeline • Activity Chart • Per Process histograms of Application and MPI activity Fig 9. Source View Fig. 10 Activity Chart

  42. ICT visualizations – process timeline & activity chart • Process Timeline • Activity timeline and counter timeline for a single process • Process Activity Chart • Same type of informartion as Global Summary Chart • Process Call Tree • Same type of information as Global Call Tree Figure 11. Process Timeline Figure 12. Process Activity Chart & Call Tree

  43. Pablo • Metrics collected • Time inclusive/exclusive of a function • Hardware counters via PAPI • Summary metrics computed from timing info • Min/max/avg/stdev/count • Profiled entities • Functions, function calls, and outer loops • All selected via GUI • Visualizations • Displays derived summary metrics color-coded and inline with source code • Shown on next slide

  44. SvPablo

  45. MPICL/Paragraph • Metrics collected • MPI functions: start time, end time, message size, message sender/recipient • Manual instrumentation: start time, end time, “work” done (up to user to pass this in) • Profiled entities • MPI function calls via PMPI interface • User functions/regions via manual instrumentation • Visualizations • Many, separated into 4 categories: utilization, communication, task, “other” • Described in following slides

  46. ParaGraph visualizations • Utilization visualizations • Display rough estimate of processor utilization • Utilization broken down into 3 states: • Idle – When program is blocked waiting for a communication operation (or it has stopped execution) • Overhead – When a program is performing communication but is not blocked (time spent within MPI library) • Busy – if execution part of program other than communication • “Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy • Communication visualizations • Display different aspects of communication • Frequency, volume, overall pattern, etc. • “Distance” computed by setting topology in options menu • Task visualizations • Display information about when processors start & stop tasks • Requires manually instrumented code to identify when processors start/stop tasks • Other visualizations • Miscellaneous things

  47. Utilization visualizations – utilization count • Displays # of processors in each state at a given moment in time • Busy shown on bottom, overhead in middle, idle on top

  48. Utilization visualizations – Gantt chart • Displays utilization state of each processor as a function of time

  49. Utilization visualizations – Kiviat diagram • Shows our friend, the Kiviat diagram • Each spoke is a single processor • Dark green shows moving average, light green shows current high watermark • Timing parameters for each can be adjusted • Metric shown can be “busy” or “busy + overhead”

  50. Utilization visualizations – streak • Shows “streak” of state • Similar to winning/losing streaks of baseball teams • Win = overhead or busy • Loss = idle • Not sure how useful this is

More Related