1 / 72

Lecture 4: Parallel Tools Landscape – Part 1

Lecture 4: Parallel Tools Landscape – Part 1. Allen D. Malony Department of Computer and Information Science. Performance and Debugging Tools. Performance Measurement and Analysis: Scalasca Vampir HPCToolkit Open | SpeedShop Periscope mpiP Paraver PerfExpert. Modeling and prediction

pekelo
Download Presentation

Lecture 4: Parallel Tools Landscape – Part 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4:Parallel Tools Landscape – Part 1 Allen D. Malony Department of Computer and Information Science

  2. Performance and Debugging Tools Performance Measurement and Analysis: • Scalasca • Vampir • HPCToolkit • Open|SpeedShop • Periscope • mpiP • Paraver • PerfExpert Modeling and prediction • Prophesy • MuMMI Autotuning Frameworks • Active Harmony • Orio and Pbound Debugging • Stat Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  3. Performance Tools Matrix Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  4. Scalasca Jülich Supercomputing Centre (Germany) German Research School for Simulation Sciences http://www.scalasca.org

  5. Scalable performance-analysis toolset for parallel codes • Focus on communication & synchronization • Integrated performance analysis process • Performance overview on call-path level via call-path profiling • In-depth study of application behavior via event tracing • Supported programming models • MPI-1, MPI-2 one-sided communication • OpenMP (basic features) • Available for all major HPC platforms Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  6. The Scalasca project: Overview • Project started in 2006 • Initial funding by Helmholtz Initiative & Networking Fund • Many follow-up projects • Follow-up to pioneering KOJAK project (started 1998) • Automatic pattern-based trace analysis • Now joint development of • Jülich Supercomputing Centre • German Research School for Simulation Sciences Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  7. The Scalasca project: Objective • Development of a scalable performance analysis toolset for most popular parallel programming paradigms • Specifically targeting large-scale parallel applications • such as those running on IBM BlueGene or Cray XT systemswith one million or more processes/threads • Latest release: • Scalasca v2.0 with Score-P support (August 2013) Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  8. Low-level event trace Call path High-level result Property Analysis  Location Scalasca: Automatic trace analysis • Idea • Automatic search for patterns of inefficient behavior • Classification of behavior & quantification of significance • Guaranteed to cover the entire event trace • Quicker than manual/visual trace analysis • Parallel replay analysis exploits available memory & processorsto deliver scalability Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  9. Scalasca 2.0 features • Open source, New BSD license • Fairly portable • IBM Blue Gene, IBM SP & blade clusters, Cray XT, SGI Altix,Solaris & Linux clusters, ... • Uses Score-P instrumenter & measurement libraries • Scalasca 2.0 core package focuses on trace-based analyses • Supports common data formats • Reads event traces in OTF2 format • Writes analysis reports in CUBE4 format • Current limitations: • No support for nested OpenMP parallelism and tasking • Unable to handle OTF2 traces containing CUDA events Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  10. Scalascaworkflow Summary report Report manipulation Optimized measurement configuration Measurement library HWC Parallel wait-state search Local event traces Wait-state report Instr. target application Scalasca trace analysis Instrumented executable Which problem? Where in the program? Which process? Instrumenter compiler / linker Source modules Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  11. Wait-state analysis • Classification • Quantification process process time time (a) Late Sender (c) Late Receiver process time (b) Late Sender / Wrong Order Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  12. Call-path profile: Computation Execution time excl. MPI comm Just 30% ofsimulation Widely spread in code Widely spread in code Widely spread in code Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  13. Call-path profile: P2P messaging MPI point- to-point communic- ation time P2P comm66% ofsimulation Primarily in scatter & gather Primarily in scatter & gather Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  14. Call-path profile: P2P sync. ops. Point-to- point msgs w/o data Masses of P2P sync.operations Processes all equally responsible Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  15. Trace analysis: Late sender Wait time of receivers blocked for late sender Half of thesend time is waiting Significant process imbalance Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  16. Scalascaapproach to performance dynamics New Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  17. Time-series call-path profiling • Instrumentation of the main loop to distinguish individual iterations • Complete call tree with multiple metrics recorded for each iteration • Challenge: storage requirements proportional to #iterations #include "epik_user.h" void initialize() {} void read_input() {} void do_work() {} void do_additional_work() {} void finish iteration() {} void write_output() {} int main() { intiter; PHASE_REGISTER(iter,”ITER”); intt; initialize(); read_input(); for(t=0; t<5; t++) { PHASE_START(iter); do_work(); do_additional_work(); finish_iteration(); PHASE_END(iter); } write_output(); return 0; } Call tree Process topology Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  18. Online compression • Exploits similarities between iterations • Summarizes similar iterations in a single iteration via clustering and structural comparisons • On-line to save memory at run-time • Process-local to • Avoid communication • Adjust to local temporal patterns • The number of clusters never exceeds a predefined maximum • Merging of the two closest ones 147.l2wrf2 MPI P2P time, original 143.dleslie MPI P2P time, original compressed, 64 clusters compressed, 64 clusters ZoltánSzebenyi et al.: Space-Efficient Time-Series Call-Path Profiling of Parallel Applications. In Proc. of the SC09 Conference, Portland, Oregon, ACM, November 2009. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  19. Reconciling sampling and direct instrumentation • Semantic compression needs direct instrumentation to capture communication metrics and to track the call path • Direct instrumentation may result in excessive overhead • New hybrid approach • Applies low-overhead sampling to user code • Intercepts MPI calls via direct instrumentation • Relies on efficient stack unwinding • Integrates measurements in statistically sound manner DROPS IGPM & SC, RWTH Joint work with ZoltanSzebenyi et al.: Reconciling sampling and direct instrumentation for unintrusive call-path profiling of MPI programs. In Proc. of IPDPS, Anchorage, AK, USA. IEEE Computer Society, May 2011. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  20. Delay analysis • Classification of waiting times into • Direct vs. indirect • Propagating vs. terminal • Attributes costs of wait states to delay intervals • Scalable through parallel forward and backward replay of traces process Direct waiting time Delay time David Böhme et al.: Identifying the root causes of wait states in large-scale parallel applications. In Proc. of ICPP, San Diego, CA, IEEE Computer Society, September 2010. Best Paper Award Indirect waiting time Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  21. HPCToolkit Rice University (USA) http://hpctoolkit.org

  22. HPCToolkit • Rice University (USA) • http://hpctoolkit.org • Integrated suite of tools for measurement and analysis of program performance • Works with multilingual, fully optimized applications that are statically or dynamically linked • Sampling based measurement • Serial, multiprocess, multithread applications Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  23. HPCToolkit / Rice University Image by John Mellor-Crummey • Performance Analysis through callpath sampling • Designed for low overhead • Hot path analysis • Recovery of program structure from binary Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  24. HPCToolkitDESIGN PRINCIPLES • Employ binary-level measurement and analysis • observe fully optimized, dynamically linked executions • support multi-lingual codes with external binary-only libraries • Use sampling-based measurement (avoid instrumentation) • controllable overhead • minimize systematic error and avoid blind spots • enable data collection for large-scale parallelism • Collect and correlate multiple derived performance metrics • diagnosis typically requires more than one species of metric • Associate metrics with both static and dynamic context • loop nests, procedures, inlined code, calling context • Support top-down performance analysis • natural approach that minimizes burden on developers Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  25. HPCToolkitWORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  26. For dynamically-linked executables on stock Linux compile and link as you usually do: nothing special needed For statically-linked executables (e.g. for Blue Gene, Cray) add monitoring by using hpclink as prefix to your link line uses “linker wrapping” to catch “control” operations process and thread creation, finalization, signals, ... HPCToolkit WORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  27. Measure execution unobtrusively launch optimized application binaries dynamically-linked applications: launch with hpcrun to measure statically-linked applications: measurement library added at link time control with environment variable settings collect statistical call path profiles of events of interest HPCToolkit WORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  28. Analyze binary with hpcstruct: recover program structure analyze machine code, line map, debugging information extract loop nesting & identify inlined procedures map transformed loops and procedures to source HPCToolkit WORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  29. Combine multiple profiles multiple threads; multiple processes; multiple executions Correlate metrics to static & dynamic program structure HPCToolkit WORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  30. Presentation explore performance data from multiple perspectives rank order by metrics to focus on what’s important compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth graph thread-level metrics for contexts explore evolution of behavior over time HPCToolkit WORKFLOW profile execution [hpcrun] call stack profile compile & link optimized binary app. source binary analysis [hpcstruct] program structure interpret profile correlate w/ source [hpcprof/hpcprof-mpi] presentation [hpcviewer/ hpctraceviewer] database Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  31. Analyzing results with : hpcviewer associated source code Callpath to hotspot Image by John Mellor-Crummey Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  32. PRINCIPAL VIEWS • Calling context tree view - “top-down” (down the call chain) • associate metrics with each dynamic calling context • high-level, hierarchical view of distribution of costs • example: quantify initialization, solve, post-processing • Caller’s view - “bottom-up” (up the call chain) • apportion a procedure’s metrics to its dynamic calling contexts • understand costs of a procedure called in many places • example: see where PGAS put traffic is originating • Flat view - ignores the calling context of each sample point • aggregate all metrics for a procedure, from any context • attribute costs to loop nests and lines within a procedure • example: assess the overall memory hierarchy performance within a critical procedure Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  33. HPCToolkitDOCUMENTATION • http://hpctoolkit.org/documentation.html • Comprehensive user manual: • http://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf • Quick start guide • essential overview that almost fits on one page • Using HPCToolkit with statically linked programs • a guide for using hpctoolkit on BG/P and Cray XT • The hpcviewer user interface • Effective strategies for analyzing program performance with HPCToolkit • analyzing scalability, waste, multicore performance ... • HPCToolkit and MPI • HPCToolkit Troubleshooting • Installation guide Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  34. USING HPCToolkit • Add hpctoolkit’s bin directory to your path • Download, build and usage instructions at http://hpctoolkit.org • Perhaps adjust your compiler flags for your application • sadly, most compilers throw away the line map unless -g is on the command line. add -g flag after any optimization flags if using anything but the Cray compilers/ Cray compilers provide attribution to source without -g. • Decide what hardware counters to monitor • dynamically-linked executables (e.g., Linux) • use hpcrun -L to learn about counters available for profiling • use papi_avail (you can sample any event listed as “profilable”) Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  35. USING HPCToolkit • Profile execution: • hpcrun –e <event1@period1> [-e <event2@period2> …] <command> [command-arguments] • Produces one .hpcrun results file per thread • Recover program structure • hpcstruct <command> • Produces one .hpcstruct file containing the loop structure of the binary • Interpret profile / correlate measurements with source code • hpcprof [–S <hpcstruct_file>] [-M thread] [–o <output_db_name>] <hpcrun_files> • Creates performance database • Use hpcviewer to visualize the performance database • Download hpcviewer for your platform from https://outreach.scidac.gov/frs/?group_id=22 Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  36. Vampir ZIH, TechnischeUniversitätDresden (Germany) http://www.vampir.eu

  37. Mission • Visualization of dynamicsof complex parallel processes • Requires two components • Monitor/Collector (Score-P) • Charts/Browser (Vampir) Typicalquestionsthat Vampir helpstoanswer: • What happens in my application execution during a given time in a given process or thread? • How do the communication patterns of my application execute on a real system? • Are there any imbalances in computation, I/O or memory usage and how do they affect the parallel execution of my application? Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  38. Event Trace Visualization with Vampir Timeline charts • Show application activities and communication along a time axis Summary charts • Provide quantitative results for the currently selected time interval • Alternative and supplement to automatic analysis • Show dynamic run-time behavior graphically at any level of detail • Provide statistics and performance metrics Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  39. CPU CPU CPU CPU CPU CPU CPU CPU Multi-Core Program Trace File(OTF2) Vampir – Visualization Modes (1) • Directly on front end or local machine % vampir Score-P Vampir8 Small/Medium sized trace Thread parallel Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  40. CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Trace File (OTF2) VampirServer Many-Core Program Vampir – Visualization Modes (2) % vampir % vampirserverstart –n 12 Vampir8 Score-P LAN/WAN Large Trace File (stays on remote machine) MPI parallel application On local machine with remote VampirServer Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  41. VampirPerformance Analysis Toolset: Usage • Instrument your application with Score-P • Run your application with an appropriate test set • Analyze your trace file with Vampir • Small trace files can be analyzed on your local workstation • Start your local Vampir • Load trace file from your local disk • Large trace files should be stored on the HPC file system • Start VampirServer on your HPC system • Start your local Vampir • Connect local Vampir with the VampirServer on the HPC system • Load trace file from the HPC file system Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  42. The maindisplaysof Vampir • Timeline Charts: • Master Timeline • Process Timeline • Counter Data Timeline • Performance Radar • Summary Charts: • Function Summary • Message Summary • Process Summary • Communication Matrix View Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  43. Visualizationofthe NPB-MZ-MPI / BT trace % vampirscorep_bt-mz_B_4x4_trace Navigation Toolbar Function Summary Function Legend Master Timeline Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  44. Visualizationofthe NPB-MZ-MPI / BT trace Master Timeline Detailed information about functions, communication and synchronization events for collection of processes. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  45. Visualizationofthe NPB-MZ-MPI / BT trace Process Timeline Detailed information about different levels of function calls in a stacked bar chart for an individual process. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  46. Visualizationofthe NPB-MZ-MPI / BT trace Typicalprogramphases Initialisation Phase Computation Phase Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  47. Visualizationofthe NPB-MZ-MPI / BT trace Counter Data Timeline Detailed counter information over time for an individual process. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  48. Visualizationofthe NPB-MZ-MPI / BT trace Performance Radar Detailed counter information over time for a collection of processes. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  49. Visualizationofthe NPB-MZ-MPI / BT trace Zoom in: Inititialisation Phase Context View:Detailed information about function “initialize_”. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

  50. Visualizationofthe NPB-MZ-MPI / BT trace Feature: Find Function Execution of function “initialize_” results in higher page fault rates. Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

More Related