1 / 38

Online Performance Monitoring, Analysis, and Visualization of Large-Scale Parallel Applications

Online Performance Monitoring, Analysis, and Visualization of Large-Scale Parallel Applications. Allen D. Malony , Sameer Shende, Robert Bell malony@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute, NeuroInformatics Center University of Oregon.

erno
Download Presentation

Online Performance Monitoring, Analysis, and Visualization of Large-Scale Parallel Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Performance Monitoring, Analysis, and Visualization of Large-Scale Parallel Applications Allen D. Malony, Sameer Shende, Robert Bell malony@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute, NeuroInformatics Center University of Oregon

  2. Outline • Problem description • Scaling and performance observation • Concern for measurement intrusion • Interest in online performance analysis • General online performance system architecture • Access models • Profiling and tracing issues • Experiments with the TAU performance system • Online profiling • Online tracing • Conclusions and future work

  3. Problem Description • Need for parallel performance observation • Instrumentation, measurement, analysis, visualization • In general, there is the concern for intrusion • Seen as a tradeoff with accuracy of performance diagnosis • Scaling complicates observation and analysis • Issues of data size, processing time, and presentation • Online approaches add capabilities as well as problems • Performance interaction, but at what cost? • Tools for large-scale performance observation online • Supporting performance system architecture • Tool integration, effective usage, and portability

  4. Scaling and Performance Observation • Consider “traditional” measurement methods • Profiling: summary statistics calculated during execution • Tracing: time-stamped sequence of execution events • More parallelism  more performance data overall • Performance specific to each thread of execution • Possible increase in number interactions between threads • Harder to manage the data (memory, transfer, storage, …) • Instrumentation more difficult with greater parallelism? • More parallelism / performance data  harder analysis • More time consuming to analyze • More difficult to visualize (meaningful displays)

  5. Concern for Measurement Intrusion • Performance measurement can affect the execution • Perturbation of “actual” performance behavior • Minor intrusion can lead to major execution effects • Problems exist even with small degree of parallelism • Intrusion is accepted consequence of standard practice • Consider intrusion (perturbation) of trace buffer overflow • Scale exacerbates the problem … or does it? • Traditional measurement techniques tend to be localized • Suggests scale may not compound local intrusion globally • Measuring parallel interactions likely will be affected • Use accepted measurement techniques intelligently

  6. Why Complicate Matters with Online Methods? • Adds interactivity to performance analysis process • Opportunity for dynamic performance observation • Instrumentation change • Measurement change • Allows for control of performance data volume • Post-mortem analysis may be “too late” • View on status of long running jobs • Allow for early termination • Computation steering to achieve “better” results • Performance steering to achieve “better” performance • Hmm, isn’t online performance observation intrusive?

  7. Related Ideas • Computational steering • Falcon (Schwan, Vetter): computational steering • Dynamic instrumentation and performance search • Paradyn (Miller): online performance bottleneck analysis • Adaptive control and performance steering • Autopilot (Reed): performance steering • Peridot (Gerndt): automatic online performance analysis • OMIS/OCM (Ludwig): monitoring system infrastructure • Cedar (Malony): system/hardware monitoring • Virtue (Reed): immersive performance visualization • …

  8. Performance Measurement Performance Instrument Performance Data Performance Control Performance Analysis Performance Visualization General Online Performance Observation System • Instrumentation and measurement components • Analysis and visualization components • Performance controland access • Monitoring = measurement + access

  9. Models of Performance Data Access (Monitoring) • Push Model • Producer/consumer style of access and transfer • Application decides when/what/how much data to send • External analysis tools only consume performance data • Availability of new data is signaled passively or actively • Pull Model • Client/server style of performance data access and transfer • Application is a performance data server • Access decisions are made externally by analysis tools • Two-way communication is required • Push/Pull Models

  10. Online Profiling Issues • Profiles are summary statistics of performance • Kept with respect to some unit of parallel execution • Profiles are distributed across the machine (in memory) • Must be gathered and delivered to profile analysis tool • Profile merging must take place (possibly in parallel) • Consistency checking of profile data • Callstack must be updated to generate correct profile data • Correct communication statistics may require completion • Event identification (not necessary is save event names) • Sequence of profile samples allow interval analysis • Interval frequency depends on profile collection delay

  11. Online Tracing Issues • Tracing gathers time sequence of events • Possibly includes performance data in event record • Trace buffers distributed across the machine • Must be gathered and delivered to trace analysis tool • Trace merging is necessary (possibly in parallel) • Trace buffers overflow to files (happens even offline) • Consistency checking of trace data • May need to generate “ghost events” before and after • What portion of trace access (since last access) • Trace analysis may be in parallel • Trace buffer storage volume can be controlled

  12. Performance Control • Instrumentation control • Dynamic instrumentation • Inserts / removes instrumentation at runtime • Measurement control • Dynamic measurement • Enabling / disabling / changing of measurement code • Dynamic instrumentation or measurement variables • Data access control • Selection of what performance data to access • Control of frequency of access

  13. TAU Performance System Framework • Tuning and Analysis Utilities (aka Tools Are Us) • Performance system framework for scalable parallel and distributed high-performance computing • Targets a general complex system computation model • nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • Portable performance profiling/tracing facility • Open software approach

  14. TAU Performance System Architecture Paraver EPILOG ParaProf

  15. Online Profile Measurement and Analysis in TAU • Standard TAU profiling • Per node/context/thread • Profile “dump” routine • Context-level • Profile file per eachthread in context • Appends to profile file • Selective event dumping • Analysis tools access filesthrough shared file system • Application-level profile“access” routine

  16. ParaProf Framework Architecture • Portable, extensible, and scalable tool for profile analysis • Offer “best of breed” capabilities to performance analysts • Build as profile analysis framework for extensibility

  17. ParaProf Profile Display (VTF) • Virtual Testshock Facility (VTF), Caltech, ASCI Center • Dynamic measurement, online analysis, visualization

  18. Full Profile Display (SAMRAI++) • Structured AMR toolkit (SAMRAI++), LLNL 512 processes

  19. Performance Steering Online Performance Profile Analysis (K. Li, UO) SCIRun (Univ. of Utah) Performance Visualizer Application // performance data streams TAU Performance System Performance Analyzer // performance data output accumulated samples Performance Data Integrator Performance Data Reader file system • sample sequencing • reader synchronization

  20. Performance Visualization in SCIRun SCIRun program

  21. Uintah Computational Framework (UCF) • Universityof Utah • UCF analysis • Scheduling • MPI library • Components • 500 processes • Use for onlineand offlinevisualization • Apply SCIRunsteering

  22. Online Unitah Performance Profiling • Demonstration of online profiling capability • Colliding elastic disks • Test material point method (MPM) code • Executed on 512 processors ASCI Blue Pacific at LLNL • Example 1 (Terrain visualization) • Exclusive execution time across event groups • Multiple time steps • Example 2 (Bargraph visualization) • MPI execution time and performance mapping • Example 3 (Domain visualization) • Task time allocation to “patches”

  23. Example 1

  24. Example 2

  25. Example 2 (continued)

  26. Example 3

  27. Online Trace Analysis and Visualization • Tracing is more challenging to do online • Trace buffer overflow can already be viewed as “online” • Write to file system (local/remote) on overflow • Causes large intrusion of execution (not synchronized) • There is potentially a lot more data to move around • TAU does dynamic event registration • Requires trace merging to make event ids consistent • Track events that actually occur • Static schemes must predefine all possible events • Decision on whether to keep trace data • Traces can be analyzed to produce statistics

  28. VNG Parallel Distributed Trace Analysis • Holger Brunst, Technical University Dresden • In association with Wolfgang Nagel (ASCI PathForward) • Brunst currently visiting University of Oregon • Based on experience in development and use of Vampir • Client - server model with parallel analysis servers • Allow parallel analysis servers and remote visualization • Keep trace data close to where it was produced • Utilize parallel computing and storage resources • Hope to gain speedup efficiencies • Split analysis and visualization functionality • Accepts VTF, STF, and TAU trace formats

  29. VNG System Architecture • Client - server model with parallel analysis servers • Allow parallel analysis servers and remote analysis vgnd vgn MPI sockets pthreads

  30. Online Trace Analysis with TAU and VNG • TAU measurement of application to generate traces • Write traces (currently) to NFS files and unify Trace accesscontrol (not yet) vgn taumerge vgnd TAU measurement system Needed for eventconsistency

  31. Experimental Online Tracing Setup • 32-processor Linux cluster

  32. Online Trace Analysis of PERC EVH1 Code • Enhanced Virginia Hydrodynamics #1 (EVH1) Strange behaviorseen on Linux platforms

  33. Evaluation of Experimental Approaches • Currently only supporting push model • File system solution for moving performance data • Is this a scalable solution? • Robust solution that can leverage high-performance I/O • May result in high intrusion • However, does not require IPC • Resolving identifiers in trace events is a real problem • Should be relatively portable

  34. Possible Improvements • Profile merging at context level to reduce number of files • Merging at node level may require explicit processing • Concurrent trace merging could also reduce files • Hierarchical merge tree • Will require explicit processing • Could consider IPC transfer • MPI (e.g., used in mpiP for profile merging) • Create own communicators • Sockets • PACX between computer server and performance analyzer • …

  35. Large-Scale System Support • Larger parallel systems will have better infrastructure • Higher performance I/O system and multiple I/O ndoes • Faster, higher bandwith networks (possible several) • Processors devoted to system operations • Hitachi SR8000 • System processor per node (8 computational processors) • Remote DMA (RDMA) • RDMA may becoming available on Infiniband • Blue Gene/L • 1024 I/O nodes (one per 64 processor) with large memory • Tree network for I/O operations and GigE as well

  36. Concluding Remarks • Interest in online performance monitoring, analysis, and visualization for large-scale parallel systems • Need to intelligently use • Benefit from other scalability considerations of the system software and system architecture • See as an extension to the parallel system architecture • Avoid solutions that have portability difficulties • In part, this is an engineering problem • Need to work with the system configuration you have • Need to understand if approach is applicable to problem • Not clear if there is a single solution

  37. Future Work • Build online support in TAU performance system • Extend to support PULL model capabilities • Hierarchical data access solutions • Performance studies • Integrate with SuperMon (Matt Sottile, LANL) • Scalable system performance monitor • Integration with other performance tools • …

More Related