1 / 40

Kernel-level Measurement for Integrated Parallel Performance Views

Kernel-level Measurement for Integrated Parallel Performance Views. A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon. Outline. Motivations Objectives ZeptoOS project KTAU Architecture

Download Presentation

Kernel-level Measurement for Integrated Parallel Performance Views

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon

  2. Outline • Motivations • Objectives • ZeptoOS project • KTAU Architecture • Case Studies - the Performance Views • Perturbation Study • KTAU improvements • Future work • Acknowledgements

  3. Motivation • Application performance is a consequence of • User-level execution • OS-level operation • Good tools exist for observing user-level performance • User-level events • Communication events • Execution time measures • Hardware performance • Fewer tools exist to observe OS-level aspects • Ideally would like to do both simultaneously • OS-level influences on application performance

  4. Scale and Performance Sensitivity • HPC systems continue to scale to larger processor counts • Application performance more performance sensitive • OS factors can lead to performance bottlenecks[Petrini’03, Jones’03, …] • System/application performance effects are complex • Isolating system-level factors is non-trivial • Require comprehensive performance understanding • Observation of all performance factors • Relative contributions and interrelationship • Can we correlate OS and application performance?

  5. Phase Performance Effects Waiting timedue to OS Overhead accumulates

  6. Program - OS Interactions • Direct • applications invoke the OS for certain services • syscalls and internal OS routines called from syscalls • Indirect • OS operations w/o explicit invocation by application • preemptive scheduling • (HW) interrupt handling • OS-background activity • can occur at any time

  7. Program - OS Interactions (continued) • Direct interactions easier to handle • Synchronous with user-code • In process-context • Indirect interactions more difficult • Usually asynchronous • Usually in interrupt-context • Harder to measure • where are the boundaries? • Harder to correlate and integrate with application measurements

  8. Performance Perspectives • Kernel-wide • Aggregate kernel activity of all active processes • Understand overall OS behavior • Identify and remove kernel hot spots • Cannot show application-specific OS actions • Process-centric • OS performance in specific application context • Virtualization and mapping performance to process • Programs, daemons, and system services interactions • Expose sources of performance problems • Tune OS for specific workload and application for OS

  9. Existing Approaches • User-space only measurement tools • Many tools only work at user-level • Cannot observe system-level performance influences • Kernel-level only measurement tools • Most only provide the kernel-wide perspective • lack proper mapping/virtualization • Some provide process-centric views • cannot integrate OS and user-level measurements

  10. Existing Approaches (continued) • Combined or integrated user/kernel measurement tools • A few tools allow fine-grained measurement • Can correlate kernel and user-level performance • Typically focus only on direct OS interactions • Indirect interactions not normally merged • Do not explicitly recognize parallel workloads • MPI ranks, OpenMP threads, … • Need an integrated approach to parallel performance observation and analyses that support both perspectives

  11. High-Level Objectives • Support low-overhead OS performance measurement at multiple levels of function and detail • Provide both kernel-wide and process-centric perspectives of OS performance • Merge user-level and kernel-level performance information across all program-OS interactions • Provide online information and the ability to function without a daemon where possible • Support both profiling and tracing for kernel-wide and process-centric views in parallel systems • Leverage existing parallel performance analysis tools • Support for observing, collecting and analyzing parallel data

  12. ZeptoOS • DOE OS/RTS for Extreme Scale Scientific Computation • Effective OS/Runtime for petascale systems • Funded ZeptoOS project • Argonne National Lab and University of Oregon • What are the fundamental limits and advanced designs required for petascale Operating System Suites? • Behaviour at large scales • Management and optimization of OS suites • Collective operations • Fault tolerance • OS performance analysis

  13. ZeptoOS and TAU/KTAU • Lots of fine-grained OS measurement is required for each component of the ZeptoOS work • How and why do the various OS source and configuration changes affect parallel applications? • How do we correlate performance data between • OS components • Parallel application and OS • Solution: TAU/KTAU • An integrated methodology and framework to measure performance of applications and OS kernel

  14. ZeptoOS Strategy • “Small Linux on big computers” • IBM BG/L and other systems (e.g., Cray XT3) • Argonne • Modified Linux on BG/L I/O nodes (ION) • Modified Linux for BG/L compute nodes (TBD) • Specialized I/O daemon on I/O node (ZOID) (TBD) • Oregon • KTAU • integration of TAU infrastructure in Linux Kernel • integration with ZeptoOS and installation on BG/L ION • port to other 32-bit and 64-bit Linux platforms

  15. KTAU Architecture

  16. Controlled Experiments • Controlled Experiments • Exercise kernel in controlled fashion • Check if KTAU produces the expected correct and meaningful views • Test machines • Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) • Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) • Benchmarks • NPB LU, SP applications[NPB] • Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution. • LMBENCH[LMBENCH] • Suite of micro-benchmarks exercising Linux kernel • Scenarios - Interrupt load, Scheduling load, Exceptions

  17. Controlled ExperimentsObserving Interrupts User-level Inclusive Time User-level Exclusive Time Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? Benchmark: NPB-SP application on 16 nodes User-level profile does not tell the whole story!

  18. Controlled ExperimentsObserving Interrupts User+OS Inclusive Time User+OS Exclusive Time MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out. Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply()

  19. Controlled ExperimentsObserving Scheduling • NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) • Simulate daemon interference using “toy” daemon • Daemon periodically wakes-up and performs activity • What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

  20. Controlled ExperimentsObserving Scheduling • Select node 8 and take a closer look … B: Process-level View (Each row is single process on host 8) ‘Toy’ Daemon activity 2 NPB LU processes

  21. Controlled ExperimentsObserving Scheduling • Instrumentation to differentiate voluntary/involuntary schedule • Experiment re-run on 4-processor SMP • Local slowdown - preemptive scheduling • Remote slowdown - voluntary scheduling (waiting!) C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!

  22. LMBENCH Page-Fault Controlled ExperimentsObserving Exceptions Call Group Relations Program-OS Call Graph

  23. Merging App / OS Traces MPI_Send OS Routines Controlled ExperimentsTracing Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR]

  24. Larger-Scale Runs • Run parallel benchmarks on larger-scale (128 dual-cpu nodes) • Identify (and remove) system-level performance issues • Understand perturbation overheads introduced by KTAU • NPB benchmark: LU Application[NPB] • Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. • ASC benchmark: Sweep3D[Sweep3d] • Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. • Test machine: Chiba-City Linux cluster (ANL) • 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

  25. Larger-Scale Runs • Experienced problems on Chiba by chance • Initially ran NPB-LU and Sweep3D codes on 128x1 configuration • Then ran on 64x2 configuration • Extreme performance hit (72% slower!) with the 64x2 runs • Used KTAU views to identify and solve issues iteratively • Eventually brought performance gap to 13% for LU and 9% for Sweep.

  26. Larger-scale Runs 64x2 Configuration MPI_Recv OS Interactions User-level MPI_Recv Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED.

  27. Larger-scale Runs Voluntary Scheduling Preemptive Scheduling Note: x-axis log scale Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling.

  28. Larger-scale Runs ccn10 Node-level View Interrupt Activity NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.

  29. Approx. 100% longer Many more OS-TCP Calls Larger-scale Runs Use ‘Merged’ performance data to identify overlap, imbalance. Why does purely compute bound region have lots of I/O? TCP within Compute : Time TCP within Compute : Calls 100% More background OS-TCP activity in Compute phase. More imbalance! • I/O Compute Overlap - Explicit MPI or OS buffering - Tcp Sends • Imbalance in workload - Tcp Recvs from ontime finishers

  30. Larger-scale Runs Cost / Call of OS-level TCP OS-TCP in SMP Costlier • IRQ-Balancing blindly distributes interrupts and bottom-halves. • E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1 • Cache issues! [COMSWARE]

  31. Perturbation Study • Five different Configurations • Baseline: Vanilla kernel, un-instrumented benchmark • Ktau-Off: Kernel instrumentations compiled-in. But turned Off - (disabled probe effect) • Prof-All: All kernel instrumentations turned On • Prof-Sched: Only scheduler subssystem’s instrumentations turned on • Prof-All+TAU: ProfAll, + user-level Tau instrumentation enabled • NPB LU application benchmark: • 16 nodes, 5 different configurations, Mean over 5 runs each • ASC Sweep3D: • 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. • Test machine: Chiba-City ANL

  32. Perturbation Study Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49% Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling.

  33. Future Work • Dynamic measurement control • Improve performance data sources • Improve integration with TAU’s user-space capabilities • Better correlation of user and kernel performance • Full callpaths and phase-based profiling • Merged user/kernel traces (already available) • Integration of TAU and KTAU with Supermon • Porting efforts to IA-64, PPC-64, and AMD Opteron • ZeptoOS characterization efforts • BGL I/O node • Dynamically adaptive kernels

  34. Recent Work on ZeptoOS Project • Accurate Identification of “noise” sources • Modified Linux on BG/L should be efficient • Effect of OS “noise” on synchronization / collectives • What OS aspects induce what types of interference • code paths • configurations • devices attached • Requires user-level and OS measurement • If can identify noise sources, then can remove or alleviate interference

  35. Approach • ANL Selfish benchmark to identify “detours” • Noise events in user-space • Shows durations and frequencies of events • Does NOT show cause or source • Runs a tight loop with an expected (ideal) duration • logs times and duration of detours • Use KTAU OS-tracing to record OS activity • Correlate time of occurrence • uses same time source as Selfish benchmark • Infer type of OS-activity (if any) causing the “detour”

  36. OS/User Performance View of Scheduling preemptivescheduling

  37. OS/User View of OS Background Activity

  38. OS/User View of OS Background Activity

  39. Other Applications of KTAU • Need to fill this in … (is there time)

  40. Acknowledgements • Department of Energy’s Office of Science • National Science Foundation • University of Oregon (UO) Core Team • Aroon Nataraj, PhD Student • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer • Suravee Suthikulpanit , MS Student (Graduated) • Argonne National Lab (ANL) Contributors • Pete Beckman • Kamil Iskra • Kazutomo Yoshii

More Related