Automatic Misconfiguration Disagnosis with PeerPressure

Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA

Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? • Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? • Very hard [Larsson2001] • Complex software components and compositions • Third party applications • …

Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks

Goals • Effectiveness • Small set of sick configuration candidates that contain the root-cause entries • Automation • No second party involvement • No need to remember or identify what is healthy

Intuition behind PeerPressure • Assumption • Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure

An Example • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not • R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric

Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 Run the faulty app HKCU\%\Software\... null Canonicalizer Search & Fetch Troubleshooting Result Database Entry Prob. Peer-to-Peer Troubleshooting Community HKLM\Software\Msft\... 0.6 Statistical Analyzer HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 PeerPressure System Overview

The Sick Probability • P(Sick) = (N + c) / (N + ct + cm (t-1) ) • N: # of the samples • C: cardinality • t: the number of suspects • m: the number of entries that match the suspect entry value • Properties: • As m increases, P decreases • As c increases, P decreases; when m = 0, smaller c implies smaller p

The PeerPressure Prototype • Database of 87 live Windows XP registry snapshots as our sample pool • hierarchical persistent storage for named, typed entries • PeerPressure troubleshooter implemented in C# • Needed to “sanitize” the entry values • 1, “1”, “#1” • Heuristics: unifying values of entries with different types

Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks

Windows Registry Characteristics • Max size: 333,193 • Min size: 77,517 • Average size: 198,376 • Median size: 198,608 • Cardinality: 87% 1, 94% <=2 • Distinct canonicalized entries in GeneBank 1,476,665 • Common canonicalized entries 43,913 • Distinct entries data-sanitized 1,820,706

Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) • Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. • The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes

Response Time • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate

Troubleshooting Effectiveness • Metric: root cause ranking • Results: • Rank = 1 for 12 cases • Rank = 2 for 3 cases • Rank = 3, 9, 12, 16 for 4 cases, respectively • cannot solve one case

Source of False Positives • Nature of the root-cause entry • Root cause entry has a large cardinality • How unique other suspects • A highly customized machine likely produces more noise • The database is not pristine

Impact of the Sample Set Size • Larger sample set doesn’t necessarily indicate better accuracy • Strong conformity doesn’t depend on the number of samples • Operational state doesn’t depend on the number of samples • Only helps with non-pristine sample set • 10 samples are large enough for most cases

Related Work • Blackbox-based techniques • Strider: need to identify the healthy [Wang ‘03] • Hardware, software component dependencies [Brown ‘01] • Much prior on leveraging statistics to pinpoint anomaly • Bug as deviant behavior [Engler et al SOSP ‘01] • Host-based intrusion detection based on system calls [Forrest ’96] and based on registry behavior [Apap et al, ‘99]

Future Work • Only scratch the surface! • Multiple root cause entries • Cross-application troubleshooting • Database maintenenance • Privacy • Friends Troubleshooting Network

Concluding Remarks • Automatic misconfiguration diagnosis is possible • Use statistics from the mass to automate manual identification of the healthy • Initial results promising

Automatic Misconfiguration Disagnosis with PeerPressure

Automatic Misconfiguration Disagnosis with PeerPressure

Presentation Transcript

Automatic Projector Calibration with Embedded Light Sensors

Automatic Forecasting with R

Automatic Editing with Soft Edits

AUTOMATIC HEADLIGHTS (ON/OFF WITH DELAY)

Bit-Vector Rewriting with Automatic Rule Generation

Automatic trace analysis with Scalasca

Automatic Acceptance Testing with FIT / FitNesse

Understanding BGP Misconfiguration

Automatic Misconfiguration Troubleshooting with PeerPressure

Automatic Software Repair with Evolutionary Computation

Modeling DNS Security: Misconfiguration, Availability, and Visualization

Automatic Color Matching with a Computer

Automatic

Saving time with automatic parking systems

Make Automatic Payments With Online Billing App

Switchblade Automatic Knives with Discounted Price

Improve Safety, Security with Automatic Gate

Cloud Misconfiguration Woes- Honda, Capital One who next

Automatic Headlight Control with Central Locking System

Automatic Video Authoring with Media Analysis

Automatic Editing with Soft Edits