1 / 15

Reference-Driven Performance Anomaly Identification

Reference-Driven Performance Anomaly Identification. University of Rochester. Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. Performance Anomalies. Complex software systems (like operating systems and distributed systems): Many system features and configuration settings

dard
Download Presentation

Reference-Driven Performance Anomaly Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reference-Driven Performance Anomaly Identification University of Rochester Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li SIGMETRICS 2009

  2. Performance Anomalies • Complex software systems (like operating systems and distributed systems): • Many system features and configuration settings • Wide-ranging workload behaviors and concurrency • Their interactions • Performance anomalies: • Low performance against expectation • Due to implementation errors, mis-configurations, or mis-managed interactions, … • Anomalies degrade the system performance; make system behaviors undependable SIGMETRICS 2009

  3. An Example Identified by Our Research • Linux anticipatory I/O scheduler • HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms. • However, inaccurate integer divisions: • HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks. • It defaults to 250 at Linux 2.6.23, so timeout becomes one tick. Premature timeouts lead to additional disk seeks. /* max time we may wait to anticipate a read (default around 6ms) */ #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009

  4. Challenges and Goals • Challenges: • Often involving semantics of multiple system components • No obvious failure symptoms; normal performance isn’t always known or even clearly defined • Performance anomaly identifications relatively rare: • 4% of resolved Linux 2.4/2.6 I/O bugs are performance-oriented • Goals: • Systematic techniques to identify performance anomalies; improve performance dependability • Consider wide-ranging configurations and workload conditions SIGMETRICS 2009

  5. Reference-driven Anomaly Identification • Given two executions T (target) and R (reference): • If T performs much worse than R against expectation, we identify T as anomalous to R. • Examples: • How to systematically derive the expectations? SIGMETRICS 2009

  6. Change Profiles • Goal – derive expected performance deviations between reference and target (or with a change of system parameters) • Approach – inference from real system measurements • Change profile – probabilistic distribution of performance deviations • p-value(–0.5) = 0.039 SIGMETRICS 2009

  7. Scalable Anomaly Quantification • Approach: • Construct single-para. profiles through real system measurements • Analytically synthesize multiple single-para. profiles for scalability • Convolution-like synthesis • Assuming independent performance effects of different parameters • Assemble multi-para. performance deviation distribution using convolutions of single-para. change profiles • Generally applicable bounding analysis • Bound multi-para. p-value anomaly from single-para. p-values (no need for parameter independence) • Find a tight bound (small p-value) through Monte Carlo method SIGMETRICS 2009

  8. Evaluation • Linux I/O case study: • Five workload parameters and three system conf. parameters • Performance measurements at 300 sampled executions; use each other as references to identify anomalies • Anomalies are target executions with p-values 0.05 or less • Validate through cause analysis; probable false positive without validated cause • Results • Linux 2.6.10 – 35 identified; 34 validated; 1 probable false positive • Linux 2.6.23 – 12 identified; 9 validated; 3 probable false positives • Linux 2.6.23 (target) vs. 2.6.10 (reference) – 15 identified; all validated SIGMETRICS 2009

  9. Comparison • Bounding analysis for multi-parameter anomaly quantification • Convolution synthesis assuming parameter independence • Rank target-reference anomaly using raw perf. difference • Convolution identifies more anomalies, but higher false positives SIGMETRICS 2009

  10. Anomaly Cause Analysis • Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging • Root cause sometimes lies in complex component interactions • Most useful hints often relate to low-level system activities • Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through • Approach: reference-driven filtering of anomaly-related metrics • Compare metric manifestations of an anomalous target and its normal reference • Those that differ significantly may be anomaly-related SIGMETRICS 2009

  11. System Events and Metrics Traced Events: • Process management • creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch • System call • enter a system call; exit a system call • Memory system • allocating pages; freeing pages; swapping pages in; swapping pages out • File system • file exec; file open; file close; file read; file write; file seek; file ioctl; file prefetch operation; starting to wait for a data buffer; end to wait for a data buffer • IO scheduling • I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion • SCSI device • SCSI read request; SCSI write request • Interrupt • enter an interrupt handler; Exit an interrupt handler • Network socket • socket call; socket send; socket receive; socket creation Derived System Metrics: • Inter-arrival time of each type of events • Delays between causal events • delay between a system call enter and exit • delay between file system buffer wait start and end • delay between a block-level I/O request arrival and is dispatch • delay between a block-level I/O request dispatch and its completion • Parameter of events • file prefetch size • SCSI I/O request size • file offset of each I/O operation to block device • I/O concurrency • system call level • block level • SCSI device level up to 1361 metrics in Linux 2.6.23 SIGMETRICS 2009

  12. A Case Result • Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux 2.6.23 • Top ranked metrics – anticipatory I/O timeouts and anticipation breaks #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009

  13. Effects of Anomaly Corrections • Anomaly corrections lead to predictable performance behavior patterns SIGMETRICS 2009

  14. Related Work • Peer differencing for debugging • Delta debugging [Zeller’02]: differencing program runs of various inputs • PeerPressure [Wang et al.’04]: differencing Windows registry settings • Triage [Tucek et al.’07]: differencing basic block execution frequency →Target program/system failures; failure symptoms easily identifiable; correct peers presumably known • Our performance anomaly identification • Challenge: both anomalous and normal performance behaviors are hard to identify in complex systems • Key contribution: scalable construction of performance deviation profiles SIGMETRICS 2009

  15. Summary • Principled use of references in performance anomaly identification • Scalable construction of performance deviation profiles to identify anomaly symptoms • Target-reference differencing of system metric manifestations to help identify anomaly causes • Identified real performance problems in Linux and J2EE-based distributed system SIGMETRICS 2009

More Related