400 likes | 514 Views
Kernel-level Measurement for Integrated Parallel Performance Views. A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon. Outline. Motivations Objectives ZeptoOS project KTAU Architecture
E N D
Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon
Outline • Motivations • Objectives • ZeptoOS project • KTAU Architecture • Case Studies - the Performance Views • Perturbation Study • KTAU improvements • Future work • Acknowledgements
Motivation • Application performance is a consequence of • User-level execution • OS-level operation • Good tools exist for observing user-level performance • User-level events • Communication events • Execution time measures • Hardware performance • Fewer tools exist to observe OS-level aspects • Ideally would like to do both simultaneously • OS-level influences on application performance
Scale and Performance Sensitivity • HPC systems continue to scale to larger processor counts • Application performance more performance sensitive • OS factors can lead to performance bottlenecks[Petrini’03, Jones’03, …] • System/application performance effects are complex • Isolating system-level factors is non-trivial • Require comprehensive performance understanding • Observation of all performance factors • Relative contributions and interrelationship • Can we correlate OS and application performance?
Phase Performance Effects Waiting timedue to OS Overhead accumulates
Program - OS Interactions • Direct • applications invoke the OS for certain services • syscalls and internal OS routines called from syscalls • Indirect • OS operations w/o explicit invocation by application • preemptive scheduling • (HW) interrupt handling • OS-background activity • can occur at any time
Program - OS Interactions (continued) • Direct interactions easier to handle • Synchronous with user-code • In process-context • Indirect interactions more difficult • Usually asynchronous • Usually in interrupt-context • Harder to measure • where are the boundaries? • Harder to correlate and integrate with application measurements
Performance Perspectives • Kernel-wide • Aggregate kernel activity of all active processes • Understand overall OS behavior • Identify and remove kernel hot spots • Cannot show application-specific OS actions • Process-centric • OS performance in specific application context • Virtualization and mapping performance to process • Programs, daemons, and system services interactions • Expose sources of performance problems • Tune OS for specific workload and application for OS
Existing Approaches • User-space only measurement tools • Many tools only work at user-level • Cannot observe system-level performance influences • Kernel-level only measurement tools • Most only provide the kernel-wide perspective • lack proper mapping/virtualization • Some provide process-centric views • cannot integrate OS and user-level measurements
Existing Approaches (continued) • Combined or integrated user/kernel measurement tools • A few tools allow fine-grained measurement • Can correlate kernel and user-level performance • Typically focus only on direct OS interactions • Indirect interactions not normally merged • Do not explicitly recognize parallel workloads • MPI ranks, OpenMP threads, … • Need an integrated approach to parallel performance observation and analyses that support both perspectives
High-Level Objectives • Support low-overhead OS performance measurement at multiple levels of function and detail • Provide both kernel-wide and process-centric perspectives of OS performance • Merge user-level and kernel-level performance information across all program-OS interactions • Provide online information and the ability to function without a daemon where possible • Support both profiling and tracing for kernel-wide and process-centric views in parallel systems • Leverage existing parallel performance analysis tools • Support for observing, collecting and analyzing parallel data
ZeptoOS • DOE OS/RTS for Extreme Scale Scientific Computation • Effective OS/Runtime for petascale systems • Funded ZeptoOS project • Argonne National Lab and University of Oregon • What are the fundamental limits and advanced designs required for petascale Operating System Suites? • Behaviour at large scales • Management and optimization of OS suites • Collective operations • Fault tolerance • OS performance analysis
ZeptoOS and TAU/KTAU • Lots of fine-grained OS measurement is required for each component of the ZeptoOS work • How and why do the various OS source and configuration changes affect parallel applications? • How do we correlate performance data between • OS components • Parallel application and OS • Solution: TAU/KTAU • An integrated methodology and framework to measure performance of applications and OS kernel
ZeptoOS Strategy • “Small Linux on big computers” • IBM BG/L and other systems (e.g., Cray XT3) • Argonne • Modified Linux on BG/L I/O nodes (ION) • Modified Linux for BG/L compute nodes (TBD) • Specialized I/O daemon on I/O node (ZOID) (TBD) • Oregon • KTAU • integration of TAU infrastructure in Linux Kernel • integration with ZeptoOS and installation on BG/L ION • port to other 32-bit and 64-bit Linux platforms
Controlled Experiments • Controlled Experiments • Exercise kernel in controlled fashion • Check if KTAU produces the expected correct and meaningful views • Test machines • Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) • Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) • Benchmarks • NPB LU, SP applications[NPB] • Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution. • LMBENCH[LMBENCH] • Suite of micro-benchmarks exercising Linux kernel • Scenarios - Interrupt load, Scheduling load, Exceptions
Controlled ExperimentsObserving Interrupts User-level Inclusive Time User-level Exclusive Time Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? Benchmark: NPB-SP application on 16 nodes User-level profile does not tell the whole story!
Controlled ExperimentsObserving Interrupts User+OS Inclusive Time User+OS Exclusive Time MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out. Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply()
Controlled ExperimentsObserving Scheduling • NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) • Simulate daemon interference using “toy” daemon • Daemon periodically wakes-up and performs activity • What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)
Controlled ExperimentsObserving Scheduling • Select node 8 and take a closer look … B: Process-level View (Each row is single process on host 8) ‘Toy’ Daemon activity 2 NPB LU processes
Controlled ExperimentsObserving Scheduling • Instrumentation to differentiate voluntary/involuntary schedule • Experiment re-run on 4-processor SMP • Local slowdown - preemptive scheduling • Remote slowdown - voluntary scheduling (waiting!) C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!
LMBENCH Page-Fault Controlled ExperimentsObserving Exceptions Call Group Relations Program-OS Call Graph
Merging App / OS Traces MPI_Send OS Routines Controlled ExperimentsTracing Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR]
Larger-Scale Runs • Run parallel benchmarks on larger-scale (128 dual-cpu nodes) • Identify (and remove) system-level performance issues • Understand perturbation overheads introduced by KTAU • NPB benchmark: LU Application[NPB] • Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. • ASC benchmark: Sweep3D[Sweep3d] • Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. • Test machine: Chiba-City Linux cluster (ANL) • 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet
Larger-Scale Runs • Experienced problems on Chiba by chance • Initially ran NPB-LU and Sweep3D codes on 128x1 configuration • Then ran on 64x2 configuration • Extreme performance hit (72% slower!) with the 64x2 runs • Used KTAU views to identify and solve issues iteratively • Eventually brought performance gap to 13% for LU and 9% for Sweep.
Larger-scale Runs 64x2 Configuration MPI_Recv OS Interactions User-level MPI_Recv Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED.
Larger-scale Runs Voluntary Scheduling Preemptive Scheduling Note: x-axis log scale Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling.
Larger-scale Runs ccn10 Node-level View Interrupt Activity NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.
Approx. 100% longer Many more OS-TCP Calls Larger-scale Runs Use ‘Merged’ performance data to identify overlap, imbalance. Why does purely compute bound region have lots of I/O? TCP within Compute : Time TCP within Compute : Calls 100% More background OS-TCP activity in Compute phase. More imbalance! • I/O Compute Overlap - Explicit MPI or OS buffering - Tcp Sends • Imbalance in workload - Tcp Recvs from ontime finishers
Larger-scale Runs Cost / Call of OS-level TCP OS-TCP in SMP Costlier • IRQ-Balancing blindly distributes interrupts and bottom-halves. • E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1 • Cache issues! [COMSWARE]
Perturbation Study • Five different Configurations • Baseline: Vanilla kernel, un-instrumented benchmark • Ktau-Off: Kernel instrumentations compiled-in. But turned Off - (disabled probe effect) • Prof-All: All kernel instrumentations turned On • Prof-Sched: Only scheduler subssystem’s instrumentations turned on • Prof-All+TAU: ProfAll, + user-level Tau instrumentation enabled • NPB LU application benchmark: • 16 nodes, 5 different configurations, Mean over 5 runs each • ASC Sweep3D: • 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. • Test machine: Chiba-City ANL
Perturbation Study Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49% Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling.
Future Work • Dynamic measurement control • Improve performance data sources • Improve integration with TAU’s user-space capabilities • Better correlation of user and kernel performance • Full callpaths and phase-based profiling • Merged user/kernel traces (already available) • Integration of TAU and KTAU with Supermon • Porting efforts to IA-64, PPC-64, and AMD Opteron • ZeptoOS characterization efforts • BGL I/O node • Dynamically adaptive kernels
Recent Work on ZeptoOS Project • Accurate Identification of “noise” sources • Modified Linux on BG/L should be efficient • Effect of OS “noise” on synchronization / collectives • What OS aspects induce what types of interference • code paths • configurations • devices attached • Requires user-level and OS measurement • If can identify noise sources, then can remove or alleviate interference
Approach • ANL Selfish benchmark to identify “detours” • Noise events in user-space • Shows durations and frequencies of events • Does NOT show cause or source • Runs a tight loop with an expected (ideal) duration • logs times and duration of detours • Use KTAU OS-tracing to record OS activity • Correlate time of occurrence • uses same time source as Selfish benchmark • Infer type of OS-activity (if any) causing the “detour”
OS/User Performance View of Scheduling preemptivescheduling
Other Applications of KTAU • Need to fill this in … (is there time)
Acknowledgements • Department of Energy’s Office of Science • National Science Foundation • University of Oregon (UO) Core Team • Aroon Nataraj, PhD Student • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer • Suravee Suthikulpanit , MS Student (Graduated) • Argonne National Lab (ANL) Contributors • Pete Beckman • Kamil Iskra • Kazutomo Yoshii