1 / 41

Engineering Analysis of High Performance Parallel Programs

Engineering Analysis of High Performance Parallel Programs. David Culler Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler. Traditional Parallel Programming Tools. Focus on showing “what program did” and “when it did it” microscopic analysis of deterministic events

baba
Download Presentation

Engineering Analysis of High Performance Parallel Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler

  2. Traditional Parallel Programming Tools • Focus on showing “what program did” and “when it did it” • microscopic analysis of deterministic events • oriented towards initial development of small programs on small data sets and small machines • Instrumentation • traces, counters, profiles • Visualization • Examples • AIMS, PTOOLS, PPP • pablo + paradyn + ... => delphi • ACTS TAU - tuning and analysis util. LLNL ISCR

  3. Example: Pablo LLNL ISCR

  4. Beyond Zeroth-order Analysis • Basic level to get to a system design that is reasonable and behaves properly under “ideal condition” • Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior • Combine empirical data with analytical models • Iterate • from What? to What if? max displacement Wind Speed LLNL ISCR

  5. Approach: Framework for Parameterized Sensitivity Analsys • framework performs analysis over numerous runs • statistical filtering • vary parameter of interest • provides means of combining data to isolate effects of interest => ROBUSTNESS Problem Data Set Generator Well-developed Parallel Program Instrumentation Tools Study Parameter Machine Characterizers • Procs • Comm. perf. • Cache • Scheduling • ... visualization, modeling LLNL ISCR

  6. Example: NAS Parallel Benchmarks • Fix problem size (NPB2.2 class A) • Two different Architectures • NOW Ultrasparc Cluster (170 MHz) • SGI Origin (250 MHz) • Six application kernels • BT - Block Tridiagonal Solve • SP - • LU - Sparse LU • MG - Multigrid • IS - Integer sort • FT - 3D FFT • Examine sensitivity to P (# procs) • time(P), speedup(P) = Time(1)/Time(P) LLNL ISCR

  7. Single Processor Performance LLNL ISCR

  8. Simplest Example: Performance( P ) • NPB2.2 on NOW and Origin 2000 (250) LLNL ISCR

  9. Understanding Speedup SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait) Tcompute = (work/p + extra) x efficiency With message passing (e.g., MPI) communication time and wait time are indistinguishable LLNL ISCR

  10. A more austere metric... • Time spent doing thing X • Total TimeX (P) = TimeX (i) • Constant for perfect speedup P i=1 LLNL ISCR

  11. Where Time is Spent ( P ) • Reveal basic Processor and network loading (vs P) LLNL ISCR

  12. Where Time is Spent ( P ) • Reveal basic Processor and network loading (vs P) • Basis for model derivation - comm(P) LLNL ISCR

  13. Why do comm. costs increase? • total volume? • volume per processor? • message overhead? • contention? LLNL ISCR

  14. Communication Volume ( P ) LLNL ISCR

  15. Communication Structure ( P ) LLNL ISCR

  16. Understanding Efficiency ( P, M ) • Want to understand both what load the program is placing on the system • and how well the system is handling that load => characterize the capability of the system via simple benchmarks (rather than advertised peaks) => combine with measured load for predictive model, & compare 30 MB/s LLNL ISCR 150 MB/s

  17. Communication Efficiency LLNL ISCR

  18. Tools => Improvements in Run Time • Efficiency analysis (vs parameters) gives insight into where to improve the system or the program • use traditional profiling to see where is program the ‘bad stuff’ happens • or go back and tune the system to do better LLNL ISCR

  19. Why does comp. time decrease? • Combining trace generation with simulation provides new structural insight • Here: clear knees in program working set ($) shift with machine size (P) LLNL ISCR

  20. Constant Problem Size Scaling 4 8 16 32 64 128 256 LLNL ISCR

  21. LU Working Sets • Sharp drop in miss rate from 512 to 1024 • WS captured by $ at 1024 KB per processor • $ size increase (< 32KB), miss rate decrease with a constant rate • New effect, 100s KB to MB $

  22. LU Working Sets • CPS scaling means smaller and smaller problem per processor • Smaller WS requirement • Miss rate curve “moves” to the left with P

  23. LU Working Sets • Given a fixed machine, we only observe a vertical slice of the graph

  24. Cluster Origin LU Working Sets Cluster Origin

  25. Cost No Effect Cost Benefit Benefit Benefit No Effect Cost No Effect Benefit No Effect Working Sets IS LU BT • There is a Cost to scaling when at larger machine size, miss rate increases • There is a Benefit to scaling when at larger machine size, miss rate decreases • Processing Efficiency is determined by - • the interaction between the changes in working set with the size of the machine FT MG SP

  26. Sensitivity to Multiprogramming • Parallel machines are increasingly general purpose • multiprogramming, at least interrupts and daemons • Many ‘ideal’ programs very sensitive to perturbations • Msg Passing is loosely coupled, but implementation may not be! LLNL ISCR

  27. Tools => Improvements in Run Time • MPI implementation spin-waits on send till network available (or queue not full) or on recv-complete • Should use two-phase spin-block LLNL ISCR

  28. Sensitivity to Seemingly Unrelated Activity • The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point • tend to get crisp, fast results in the wee hours • Extend study outside the app • Example: two programs on big Origin • alone together on 64 P • 8 processor IS run: 4.71 sec 6.18 • 36 processor SP run: 26.36 sec 65.28 LLNL ISCR

  29. Repeatability • The variance for the repeated runs is a key result for production codes - the real world is not ideal LLNL ISCR

  30. Understanding the Platform • A very Simple Example: broadcast(M,P) • vary M, P • repeat end time start time MPI bcast MPI barrier MPI barrier LLNL ISCR

  31. NOW bcast (m, p) LLNL ISCR

  32. Origin mean bcast (m, p) LLNL ISCR

  33. NOW bcast (1024, p) LLNL ISCR

  34. Origin bcast (1024, p) LLNL ISCR

  35. NOW bcast(1042, 16) repetitions discarded first iteration LLNL ISCR

  36. Origin bcast(1042, 16) repetitions discarded first iteration LLNL ISCR

  37. Origin bcast(1042, 16) repetitions - 10x LLNL ISCR

  38. Origin bcast(1042, 16) repetitions LLNL ISCR

  39. Origin bcast(1M, 16) repetitions LLNL ISCR

  40. Discussion • Apply engineering analysis to your parallel engineering analysis codes! • Isolate components • Introduce controlled variations • processors • data set • communication rate • repetition • Identify trouble spots LLNL ISCR

  41. To read more • Parallel Computer Architecture - a hardware/software approach, Culler and Singh, Morgan-Kaufmann • Architectural Requirements and Scalability of the NAS Parallel Benchmarks, Wong, Martin, Arpaci-Dusseau, and Culler, Proc. of SC99 • Building MPI for Multi-Programming Systems using Implicit Information, Wong, Arpaci-Dusseau, Culler, 6th European PVM/MPI User's Group Meeting • http://www.cs.berkeley.edu/~culler/papers LLNL ISCR

More Related