1 / 20

Tools for Engineering Analysis of High Performance Parallel Programs

Tools for Engineering Analysis of High Performance Parallel Programs. David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler/talks. Traditional Parallel Programming Tools.

mstroup
Download Presentation

Tools for Engineering Analysis of High Performance Parallel Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler/talks

  2. Traditional Parallel Programming Tools • Focus on showing “what program did” and “when it did it” • microscopic analysis of deterministic events • oriented towards initial development of small programs on small data sets and small machines • Instrumentation • traces, counters, profiles • Visualization • Examples • AIMS, PTOOLS, PPP • pablo + paradyn + ... => delphi • ACTS TAU - tuning and analysis util. LLNL ASCI III

  3. Example: Pablo LLNL ASCI III

  4. Beyond Zeroth-order Analysis • Basic level to get to a system design that is reasonable and behaves properly under “ideal condition” • Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior • Combine empirical data with analytical models • Iterate • from What? to What if? max displacement Wind Speed LLNL ASCI III

  5. Approach: Framework for Parameterized Sensitivity Analsys • framework performs analysis over numerous runs • statistical filtering • vary parameter of interest • provides means of combining data to isolate effects of interest => ROBUSTNESS Problem Data Set Generator Well-developed Parallel Program Instrumentation Tools Study Parameter Machine Characterizers • Procs • Comm. perf. • Cache • Scheduling • ... visualization, modeling LLNL ASCI III

  6. Simplest Example: Performance( P ) • NPB2.2 on NOW and Origin 2000 (250) LLNL ASCI III

  7. Where Time is Spent ( P ) • Reveal basic Processor and network loading (vs P) • Basis for model derivation - comm(P) LLNL ASCI III

  8. Where Time is Spent ( P ) - cont • Reveal basic Processor and network loading (vs P) LLNL ASCI III

  9. Communication Volume ( P ) LLNL ASCI III

  10. Communication Structure ( P ) LLNL ASCI III

  11. Understanding Efficiency ( P, M ) • Want to understand both what load the program is placing on the system • and how well the system is handling that load => characterize the capability of the system via simple benchmarks (rather than advertised peaks) => combine with measured load for predictive model, & compare LLNL ASCI III

  12. Communication Efficiency LLNL ASCI III

  13. Tools => Improvements in Run Time • Efficiency analysis (vs parameters) gives insight into where to improve the system or the program • use traditional profiling to see where is program the ‘bad stuff’ happens • or go back and tune the system to do better LLNL ASCI III

  14. Cache Behavior (P, $) • Combining trace generation with simulation provides new structural insight • Here: clear knees in program working set ($) these shift with machine size (P) LLNL ASCI III

  15. Cache Behavior (P, $) • Clear knees in program working set ($) not affected by P LLNL ASCI III

  16. Sensitivity to Multiprogramming • Parallel machines are increasingly general purpose • multiprogramming, at least interrupts and daemons • Many ‘ideal’ programs very sensitive to perturbations • Msg Passing is loosely coupled, but implementation may not be! LLNL ASCI III

  17. Tools => Improvements in Run Time • MPI implementation spin-waits on send till network available (or queue not full) or on recv-complete • Should use two-phase spin-block LLNL ASCI III

  18. Sensitivity to Seemingly Unrelated Activity • The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point • tend to get crisp, fast results in the wee hours • Extend study outside the app • Example: two programs on big Origin • alone together on 64 P • 8 processor IS run: 4.71 sec 6.18 • 36 processor SP run: 26.36 sec 65.28 LLNL ASCI III

  19. Repeatability • The variance for the repeated runs is a key result for production codes - the real world is not ideal LLNL ASCI III

  20. Plans • Integrate our instrumentation and analysis tools with ACTS TAU • port to UCB Millennium environment • experiment with ASCI platforms • Refine and complete the automated sensitivity analysis framework • Backend performance data storage • Pablo SPPF? • Next Year • integrate performance model development, prediction LLNL ASCI III

More Related