1 / 32

A Case for an Open Source Data Repository

A Case for an Open Source Data Repository. Archana Ganapathi Department of EECS, UC Berkeley (archanag@cs.berkeley.edu). Why do we study failure data?. Understand cause->effect relationship between configurations and system behavior

calista
Download Presentation

A Case for an Open Source Data Repository

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Case for an Open Source Data Repository Archana Ganapathi Department of EECS, UC Berkeley (archanag@cs.berkeley.edu)

  2. Why do we study failure data? • Understand cause->effect relationship between configurations and system behavior • Still don’t have a complete understanding of failures in systems • Can’t worry about fixing problems if we don’t understand them in the first place • Gauge behavioral changes over time • Need realistic workload/faultload data to test/evaluate systems • Success stories…people have benefited from failure data analysis

  3. Crash data collection success stories • Berkeley EECS • BOINC • 2 Unnamed Companies

  4. …So Why Does Windows Crash?

  5. Definitions • Crash • Event caused by a problem in the operating system(OS) or application(app) • Requires OS or app restart. • Application Crash • A crash occurring at user-level, caused by one or more components (.exe/.dll files) • Requires an application restart. • Application Hang • An application crash caused as a result of the user terminating a process that is potentially deadlocked or running an infinite loop. • Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet) • OS Crash • A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines. • Blue-screen-generating crashes require a machine reboot • Windows explorer crashes require restarting the explorer process. • Bluescreen • An OS crash that produces a user-visible blue screen followed by a non-optional machine reboot.

  6. Procedure • Collect crash dumps from two different sources • UC Berkeley EECS department • BOINC volunteers • Filter data/form crash clusters to avoid double-counting • Account for shared resources, dependent processes, system instability, user retry • Parse/Interpret crash dumps using Debugging tools for Windows • Study both application crash behavior and operating systems crashes • Supplement crash data with usage data

  7. EECS Dataset

  8. Crashes reported per month

  9. Usage/Crashes per day of week • EECS department users use their EECS computers Monday through Friday. • Few users use computers on weekends. • Crashes do not occur uniformly across the five days of the working week.

  10. Usage/Crashes per hour of day • Most people work during the typical hours of 9am to 5pm. • Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules

  11. Reboot Frequency

  12. Automatic Clustering Experiment for Categorizing Apps • Augment the crash data with information about usage patterns and program dependencies • Feed data into the k-means and agglomerative clustering algorithms to determine which applications are behaviorally related. • We determined that we did not have enough data to derive a method to categorize applications in our data set • Need several instances of every (application, component, error code) combo • As a last resort, we chose to categorize apps based on categorization based on application functionality

  13. Crash Cause by Application Category

  14. Application Hang vs Crashes due to Faulty Component

  15. Which applications hang?

  16. Which components cause crashes?

  17. BOINC http://winerror.cs.berkeley.edu/crashcollection/ • Berkeley Open Infrastructure for Network Computing • Users download boinc client app • Crash dumps are scraped/sent to boinc servers • Currently 791 accounts created for crash collection + resource management • 492 users for crash collection

  18. OS Crashes • Driver faults • asynchronous events • code must follow kernel programming etiquette • exceedingly difficult to debug • Memory corruption • Hardware problems (e.g. non-ECC mem) • Software-related • 47 of these in our dataset so far…don’t have tools to analyze these in detail

  19. OS crash causing images(based on 150 boinc users, 562 crashes)

  20. Crash generating driver fault type

  21. Summary of crash analysis • Application crashes are caused by both faulty non-robust dll files as well as impatient users • OS crashes are predominantly caused by poorly-written device driver code • Commonly used core components are blamed for most crashes • need to improve reliability of these components

  22. Practical techniques to reduce crashes • Software-Based Fault Isolation • Nooks • Separate protection level for drivers • Move driver code to user libraries • Virtual Machine for each unsafe/distrusted app

  23. Lessons from crash data study • Clearly people want to know what’s wrong and how to fix it • The more feedback we give, the more data sets we receive • ...but it’s not as easy as it sounds

  24. What kinds of data should we collect? • Failure data • Configuration information • Logs of normal behavior • Usage data • Performance logs • Annotations of data • Collect data for Individual Machines + Services

  25. Why are people afraid of sharing data? • Fear of public humiliation (reverse engineering what user was doing) • Revealing problems within their organization • Fear of competitors using data against them • Revealing loopholes through which malware can easily propagate. • Revealing dependability problems in third party products (MS)

  26. Non-technical challenges to getting data • Collecting (useful) data is tedious • What information is “necessary and sufficient” to understand data trends? • Privacy concerns • Especially with usage data • Finding the person with access to data • No central location that can be queried for data • Legal agreements take a long time to draft • Researchers are more willing to share data than lawyers • Publicity

  27. Technical solution • Amortize the cost of data collection by building an open source repository • Provide a set of tools to cleanse and mine the data

  28. What tools should we implement? • Collect • BOINC • Instrumentation (MS, Pinpoint) • Pre-aggregated data from companies • Anonymize/Preprocess • Pre-written anonymization tools • Company-specific privacy requirements • Hash values of certain fields • Drop irrelevent fields • Mask part of data

  29. Tools cont’d • Store • Open-source repository schema • Common log format/ data descriptor headers • Tools to convert log metadata to common format to cross-link data tables • Sample queries: data mining ~ asking questions about data as it is • Analyze/Experiment • SLT algorithms • Visualization • Stream processing • Other tools (eg. WinDbg)

  30. Thoughts on Collection/Anonymization • Defining necessary and sufficient • Bad example: Cannot correlate crashes if we getting rid of all user/machine names • Good example: Hash user/machine names • Default: hide if not necessary? • What would it take for you not to invoke the legal dept?

  31. Thoughts on Storage/Analysis • Use time/data source as primary key? • How domain-specific should the common format be? • Management logistics… • Access control…

  32. Acronym Suggestions??? Open Source (Failure) Data Repository

More Related