A Case for an Open Source Data Repository Archana Ganapathi Department of EECS, UC Berkeley (firstname.lastname@example.org)
Why do we study failure data? • Understand cause->effect relationship between configurations and system behavior • Still don’t have a complete understanding of failures in systems • Can’t worry about fixing problems if we don’t understand them in the first place • Gauge behavioral changes over time • Need realistic workload/faultload data to test/evaluate systems • Success stories…people have benefited from failure data analysis
Crash data collection success stories • Berkeley EECS • BOINC • 2 Unnamed Companies
Definitions • Crash • Event caused by a problem in the operating system(OS) or application(app) • Requires OS or app restart. • Application Crash • A crash occurring at user-level, caused by one or more components (.exe/.dll files) • Requires an application restart. • Application Hang • An application crash caused as a result of the user terminating a process that is potentially deadlocked or running an infinite loop. • Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet) • OS Crash • A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines. • Blue-screen-generating crashes require a machine reboot • Windows explorer crashes require restarting the explorer process. • Bluescreen • An OS crash that produces a user-visible blue screen followed by a non-optional machine reboot.
Procedure • Collect crash dumps from two different sources • UC Berkeley EECS department • BOINC volunteers • Filter data/form crash clusters to avoid double-counting • Account for shared resources, dependent processes, system instability, user retry • Parse/Interpret crash dumps using Debugging tools for Windows • Study both application crash behavior and operating systems crashes • Supplement crash data with usage data
Usage/Crashes per day of week • EECS department users use their EECS computers Monday through Friday. • Few users use computers on weekends. • Crashes do not occur uniformly across the five days of the working week.
Usage/Crashes per hour of day • Most people work during the typical hours of 9am to 5pm. • Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules
Automatic Clustering Experiment for Categorizing Apps • Augment the crash data with information about usage patterns and program dependencies • Feed data into the k-means and agglomerative clustering algorithms to determine which applications are behaviorally related. • We determined that we did not have enough data to derive a method to categorize applications in our data set • Need several instances of every (application, component, error code) combo • As a last resort, we chose to categorize apps based on categorization based on application functionality
BOINC http://winerror.cs.berkeley.edu/crashcollection/ • Berkeley Open Infrastructure for Network Computing • Users download boinc client app • Crash dumps are scraped/sent to boinc servers • Currently 791 accounts created for crash collection + resource management • 492 users for crash collection
OS Crashes • Driver faults • asynchronous events • code must follow kernel programming etiquette • exceedingly difficult to debug • Memory corruption • Hardware problems (e.g. non-ECC mem) • Software-related • 47 of these in our dataset so far…don’t have tools to analyze these in detail
Summary of crash analysis • Application crashes are caused by both faulty non-robust dll files as well as impatient users • OS crashes are predominantly caused by poorly-written device driver code • Commonly used core components are blamed for most crashes • need to improve reliability of these components
Practical techniques to reduce crashes • Software-Based Fault Isolation • Nooks • Separate protection level for drivers • Move driver code to user libraries • Virtual Machine for each unsafe/distrusted app
Lessons from crash data study • Clearly people want to know what’s wrong and how to fix it • The more feedback we give, the more data sets we receive • ...but it’s not as easy as it sounds
What kinds of data should we collect? • Failure data • Configuration information • Logs of normal behavior • Usage data • Performance logs • Annotations of data • Collect data for Individual Machines + Services
Why are people afraid of sharing data? • Fear of public humiliation (reverse engineering what user was doing) • Revealing problems within their organization • Fear of competitors using data against them • Revealing loopholes through which malware can easily propagate. • Revealing dependability problems in third party products (MS)
Non-technical challenges to getting data • Collecting (useful) data is tedious • What information is “necessary and sufficient” to understand data trends? • Privacy concerns • Especially with usage data • Finding the person with access to data • No central location that can be queried for data • Legal agreements take a long time to draft • Researchers are more willing to share data than lawyers • Publicity
Technical solution • Amortize the cost of data collection by building an open source repository • Provide a set of tools to cleanse and mine the data
What tools should we implement? • Collect • BOINC • Instrumentation (MS, Pinpoint) • Pre-aggregated data from companies • Anonymize/Preprocess • Pre-written anonymization tools • Company-specific privacy requirements • Hash values of certain fields • Drop irrelevent fields • Mask part of data
Tools cont’d • Store • Open-source repository schema • Common log format/ data descriptor headers • Tools to convert log metadata to common format to cross-link data tables • Sample queries: data mining ~ asking questions about data as it is • Analyze/Experiment • SLT algorithms • Visualization • Stream processing • Other tools (eg. WinDbg)
Thoughts on Collection/Anonymization • Defining necessary and sufficient • Bad example: Cannot correlate crashes if we getting rid of all user/machine names • Good example: Hash user/machine names • Default: hide if not necessary? • What would it take for you not to invoke the legal dept?
Thoughts on Storage/Analysis • Use time/data source as primary key? • How domain-specific should the common format be? • Management logistics… • Access control…
Acronym Suggestions??? Open Source (Failure) Data Repository