1 / 43

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU. Aroon Nataraj Performance Research Lab University of Oregon. KTAU: Outline. Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views

caelan
Download Presentation

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel-level Measurement for Integrated Parallel Performance ViewsKTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon

  2. KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • Future work and directions • Acknowledgements

  3. Introduction : ZeptoOS and TAU • DOE OS/RTS for Extreme Scale Scientific Computation(Fastos) • Conduct OS research to provide effective OS/Runtime for petascale systems • ZeptoOS (under Fastos) • Scalable components for petascale architectures • Joint project Argonne National Lab and University of Oregon • ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) • University of Oregon • Kernel performance monitoring, tuning • KTAU • Integration of TAU infrastructure with Linux Kernel • Integration with ZeptoOS, installation on BG/L • Port to 32-bit and 64-bit Linux platforms

  4. KTAU: Motivation • Application Performance • user-level execution performance + • OS-level operations performance • Different Domains: E.g. Time, Hardware Perf. Metrics • PAPI (Performance Application Programming Interface) • Exposes virtualized hardware counters • TAU (Tuning and Analysis Utility) • Measures a lot of interesting user-level entities: parallel application, MPI, libraries … • Time domain • Uses PAPI to correlate counter information with source

  5. KTAU: MotivationSimple Parallel Model

  6. KTAU: MotivationSimple Parallel Model - Scale

  7. KTAU: MotivationEffects of Scale • As HPC systems continue to scale to larger processor counts • Application performance more sensitive • New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…]) • Isolating these system-level issues as bottlenecks is non-trivial • Comprehensive performance understanding • Observation of all performance factors • Relative contributions and interrelationship: can we correlate?

  8. KTAU: MotivationProgram - OS Interactions • Program OS Interactions - Direct vs. Indirect Entry Points • Direct - Applications invoke the OS for certain services • Syscalls (and internal OS routines called directly from syscalls) • Indirect - OS takes actions without explicit invocation by application • Preemptive Scheduling • (HW) Interrupt handling • OS-background activity (keeping track of time and timers, bottom-half handling, etc) • Indirect interactions can occur at any OS entry (not just when entering through Syscalls)

  9. KTAU: MotivationProgram - OS Interactions

  10. KTAU: MotivationProgram - OS Interactions • Direct Interactions easier to handle • Synchronous with user-code and in process-context • Indirect Interactions more difficult to handle • Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements • Indirect interactions may be unrelated to current task • E.g. Kernel-level packet processing for another process • But related in terms of time to current process

  11. KTAU: MotivationProgram - OS Interactions(Partial)

  12. KTAU: MotivationKernel-wide vs. Process-centric • Kernel-wide - Aggregate kernel activity of all active processes in system • Understand overall OS behavior, identify and remove kernel hot spots. • Cannot show what parts of app. spend time in OS and why • Process-centric perspective - OS performance within context of a specific application’s execution • Virtualization and Mapping performance to process • Interactions between programs, daemons, and system services • Tune OS for specific workload or tune application to better conform to OS config. • Expose real source of performance problems (in the OS or the application)

  13. KTAU: MotivationKernel-wide vs. Process-centric

  14. KTAU: MotivationExisting Approaches • User-space Only measurement tools • Many tools only work at user-level and cannot observe system-level performance influences • Kernel-level Only measurement tools • Most only provide the kernel-wide perspective – lack proper mapping/virtualization • Some provide process-centric views but cannot integrate OS and user-level measurements • Combined or Integrated User/Kernel Measurement Tools • A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance • Typically these focus only on Direct OS interactions. Indirect interactions not merged. • Using Combinations of above tools • Without better integration, does not allow fine-grained correlation between OS and App. • Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) • Need an integrated approach to parallel perf. observation, analyses

  15. Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Integrate user and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process-centric views in parallel systems Leverage existing parallel performance analysis/viz tools Support for observing, collecting and analyzing parallel data KTAU: High-Level Objectives

  16. KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • ZeptoOS – KTAU on Blue Gene / L • Future work and directions • Acknowledgements

  17. KTAU Architecture

  18. KTAU: Arch. / Impl. Choices • Instrumentation • Static Source instrumentation • Macro Map-ID: Map block of code and process-context to unique index (dense id-space) – easy array lookup. • Macro Start, Stop – provide the mapping index and process-context is implicit • Measurement • Differentiate between ‘local/self’ and ‘inter-context’ access. HPC codes primarily use ‘self’. • Store performance data in PCB (task_struct) • Integrating Kernel/User Performance state • Don’t assume synchronous kernel-entry or process-context • Have to use memory mapping between kernel and appl. State • Pinning shared state in memory • Kernel Call Groups – program-OS interactions summary • Analyses and Visualization – Use TAU facilities

  19. KTAU: Controlled Experiments • Controlled Experiments • Exercise kernel in controlled fashion • Check if KTAU produces the expected correct and meaningful views • Test machines • Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) • Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) • Benchmarks • NPB LU, SP applications[NPB] • Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution. • LMBENCH[LMBENCH] • Suite of micro-benchmarks exercising Linux kernel • A few others not shown (e.g. SKAMPI)

  20. KTAU: Controlled Examples continued… Profiling

  21. KTAU: Controlled ExperimentsObserving Interrupts User-level Inclusive Time User-level Exclusive Time Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? Benchmark: NPB-SP application on 16 nodes User-level profile does not tell the whole story!

  22. KTAU: Controlled ExperimentsObserving Interrupts User+OS Inclusive Time User+OS Exclusive Time MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out. Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply()

  23. KTAU: Controlled ExperimentsObserving Scheduling • NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) • Simulate daemon interference using “toy” daemon • Daemon periodically wakes-up and performs activity • What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

  24. KTAU: Controlled ExperimentsObserving Scheduling • Select node 8 and take a closer look … B: Process-level View (Each row is single process on host 8) ‘Toy’ Daemon activity 2 NPB LU processes

  25. KTAU: Controlled ExperimentsObserving Scheduling • Instrumentation to differentiate voluntary/involuntary schedule • Experiment re-run on 4-processor SMP • Local slowdown - preemptive scheduling • Remote slowdown - voluntary scheduling (waiting!) C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!

  26. LMBENCH Page-Fault KTAU: Controlled ExperimentsObserving Exceptions Call Group Relations Program-OS Call Graph

  27. Merging App / OS Traces MPI_Send OS Routines KTAU: Controlled ExperimentsTracing Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR]

  28. KTAU: Controlled Examples continued…Tracing Correlating CIOD and RPC-IOD Activity

  29. KTAU: Larger-Scale Runs • Run parallel benchmarks on larger-scale (128 dual-cpu nodes) • Identify (and remove) system-level performance issues • Understand perturbation overheads introduced by KTAU • NPB benchmark: LU Application[NPB] • Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. • ASC benchmark: Sweep3D[Sweep3d] • Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. • Test machine: Chiba-City Linux cluster (ANL) • 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

  30. KTAU: Larger-Scale Runs • Experienced problems on Chiba by chance • Initially ran NPB-LU and Sweep3D codes on 128x1 configuration • Then ran on 64x2 configuration • Extreme performance hit (72% slower!) with the 64x2 runs • Used KTAU views to identify and solve issues iteratively • Eventually brought performance gap to 13% for LU and 9% for Sweep.

  31. KTAU: Larger-scale Runs User-level MPI_Recv MPI_Recv OS Interactions Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED.

  32. KTAU: Larger-scale Runs Voluntary Scheduling Preemptive Scheduling Note: x-axis log scale Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling.

  33. KTAU Larger-scale Runs ccn10 Node-level View Interrupt Activity NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.

  34. Approx. 100% longer Many more OS-TCP Calls KTAU Larger-scale Runs Use ‘Merged’ performance data to identify imbalance.Why does purely compute bound region have lots of I/O? TCP within Compute : Time TCP within Compute : Calls 100% More background OS-TCP activity in Compute phase. More imbalance!

  35. KTAU Larger-scale Runs Cost / Call of OS-level TCP OS-TCP in SMP Costlier • IRQ-Balancing blindly distributes interrupts and bottom-halves. • E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1 • Cache issues! [COMSWARE]

  36. KTAU Perturbation Study • Five different Configurations • Base: Vanilla kernel, un-instrumented benchmark • Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. But all instrumentations turned Off (boot-time control) • Prof-All: All kernel instrumentations turned On. • Prof-Sched: Only scheduler subssystem’s instrumentations turned on • Prof-All+TAU: ProfAll, but also with user-level Tau instrumentation enabled • NPB LU application benchmark: • 16 nodes, 5 different configurations, Mean over 5 runs each • ASC Sweep3D: • 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. • Test machine: Chiba-City ANL

  37. KTAU Perturbation Study Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49% Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling.

  38. KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • Future work and directions • Acknowledgements

  39. KTAU: Future Work • Dynamic measurement control - enable/disable events w/o recompilation or reboot • Improve performance data sources that KTAU can access - E.g. PAPI • Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information • full callpaths, • phase-based profiling, • merged user/kernel traces • Integration of Tau, Ktau with Supermon (possibly MRNet?), TAUg (next) • Porting efforts: IA-64, PPC-64 and AMD Opteron • ZeptoOS: Planned characterization efforts • BGL I/O node • Dynamically adaptive kernels

  40. Acknowledgements • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer, PRL • Suravee Suthikulpanit , MS Student (Graduated)

  41. Support Acknowledgements • Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and • National Science Foundation (grant no. NSF CCF 0444475)

  42. References • [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 • [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 • [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000. • [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996. • [ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/ • [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

  43. References • [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 • [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 • [TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/ • [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006. • [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” (under submission)

More Related