190 likes | 343 Views
Automatic Misconfiguration Disagnosis with PeerPressure. Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA. Misconfiguration Diagnosis. Technical support contributes 17% of TCO [Tolly2000]
E N D
Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004, San Francisco, CA
Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? • Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? • Very hard [Larsson2001] • Complex software components and compositions • Third party applications • …
Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks
Goals • Effectiveness • Small set of sick configuration candidates that contain the root-cause entries • Automation • No second party involvement • No need to remember or identify what is healthy
Intuition behind PeerPressure • Assumption • Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure
An Example • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not • R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric
Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 Run the faulty app HKCU\%\Software\... null Canonicalizer Search & Fetch Troubleshooting Result Database Entry Prob. Peer-to-Peer Troubleshooting Community HKLM\Software\Msft\... 0.6 Statistical Analyzer HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 PeerPressure System Overview
The Sick Probability • P(Sick) = (N + c) / (N + ct + cm (t-1) ) • N: # of the samples • C: cardinality • t: the number of suspects • m: the number of entries that match the suspect entry value • Properties: • As m increases, P decreases • As c increases, P decreases; when m = 0, smaller c implies smaller p
The PeerPressure Prototype • Database of 87 live Windows XP registry snapshots as our sample pool • hierarchical persistent storage for named, typed entries • PeerPressure troubleshooter implemented in C# • Needed to “sanitize” the entry values • 1, “1”, “#1” • Heuristics: unifying values of entries with different types
Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks
Windows Registry Characteristics • Max size: 333,193 • Min size: 77,517 • Average size: 198,376 • Median size: 198,608 • Cardinality: 87% 1, 94% <=2 • Distinct canonicalized entries in GeneBank 1,476,665 • Common canonicalized entries 43,913 • Distinct entries data-sanitized 1,820,706
Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) • Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. • The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes
Response Time • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate
Troubleshooting Effectiveness • Metric: root cause ranking • Results: • Rank = 1 for 12 cases • Rank = 2 for 3 cases • Rank = 3, 9, 12, 16 for 4 cases, respectively • cannot solve one case
Source of False Positives • Nature of the root-cause entry • Root cause entry has a large cardinality • How unique other suspects • A highly customized machine likely produces more noise • The database is not pristine
Impact of the Sample Set Size • Larger sample set doesn’t necessarily indicate better accuracy • Strong conformity doesn’t depend on the number of samples • Operational state doesn’t depend on the number of samples • Only helps with non-pristine sample set • 10 samples are large enough for most cases
Related Work • Blackbox-based techniques • Strider: need to identify the healthy [Wang ‘03] • Hardware, software component dependencies [Brown ‘01] • Much prior on leveraging statistics to pinpoint anomaly • Bug as deviant behavior [Engler et al SOSP ‘01] • Host-based intrusion detection based on system calls [Forrest ’96] and based on registry behavior [Apap et al, ‘99]
Future Work • Only scratch the surface! • Multiple root cause entries • Cross-application troubleshooting • Database maintenenance • Privacy • Friends Troubleshooting Network
Concluding Remarks • Automatic misconfiguration diagnosis is possible • Use statistics from the mass to automate manual identification of the healthy • Initial results promising