1 / 19

Automatic Misconfiguration Disagnosis with PeerPressure

Automatic Misconfiguration Disagnosis with PeerPressure. Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA. Misconfiguration Diagnosis. Technical support contributes 17% of TCO [Tolly2000]

ann
Download Presentation

Automatic Misconfiguration Disagnosis with PeerPressure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA

  2. Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? • Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? • Very hard [Larsson2001] • Complex software components and compositions • Third party applications • …

  3. Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks

  4. Goals • Effectiveness • Small set of sick configuration candidates that contain the root-cause entries • Automation • No second party involvement • No need to remember or identify what is healthy

  5. Intuition behind PeerPressure • Assumption • Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure

  6. An Example • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not • R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric

  7. Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 Run the faulty app HKCU\%\Software\... null Canonicalizer Search & Fetch Troubleshooting Result Database Entry Prob. Peer-to-Peer Troubleshooting Community HKLM\Software\Msft\... 0.6 Statistical Analyzer HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 PeerPressure System Overview

  8. The Sick Probability • P(Sick) = (N + c) / (N + ct + cm (t-1) ) • N: # of the samples • C: cardinality • t: the number of suspects • m: the number of entries that match the suspect entry value • Properties: • As m increases, P decreases • As c increases, P decreases; when m = 0, smaller c implies smaller p

  9. The PeerPressure Prototype • Database of 87 live Windows XP registry snapshots as our sample pool • hierarchical persistent storage for named, typed entries • PeerPressure troubleshooter implemented in C# • Needed to “sanitize” the entry values • 1, “1”, “#1” • Heuristics: unifying values of entries with different types

  10. Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks

  11. Windows Registry Characteristics • Max size: 333,193 • Min size: 77,517 • Average size: 198,376 • Median size: 198,608 • Cardinality: 87% 1, 94% <=2 • Distinct canonicalized entries in GeneBank 1,476,665 • Common canonicalized entries 43,913 • Distinct entries data-sanitized 1,820,706

  12. Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) • Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. • The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes

  13. Response Time • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate

  14. Troubleshooting Effectiveness • Metric: root cause ranking • Results: • Rank = 1 for 12 cases • Rank = 2 for 3 cases • Rank = 3, 9, 12, 16 for 4 cases, respectively • cannot solve one case

  15. Source of False Positives • Nature of the root-cause entry • Root cause entry has a large cardinality • How unique other suspects • A highly customized machine likely produces more noise • The database is not pristine

  16. Impact of the Sample Set Size • Larger sample set doesn’t necessarily indicate better accuracy • Strong conformity doesn’t depend on the number of samples • Operational state doesn’t depend on the number of samples • Only helps with non-pristine sample set • 10 samples are large enough for most cases

  17. Related Work • Blackbox-based techniques • Strider: need to identify the healthy [Wang ‘03] • Hardware, software component dependencies [Brown ‘01] • Much prior on leveraging statistics to pinpoint anomaly • Bug as deviant behavior [Engler et al SOSP ‘01] • Host-based intrusion detection based on system calls [Forrest ’96] and based on registry behavior [Apap et al, ‘99]

  18. Future Work • Only scratch the surface! • Multiple root cause entries • Cross-application troubleshooting • Database maintenenance • Privacy • Friends Troubleshooting Network

  19. Concluding Remarks • Automatic misconfiguration diagnosis is possible • Use statistics from the mass to automate manual identification of the healthy • Initial results promising

More Related