A Case for an Open Source Data Repository

A Case for an Open Source Data Repository Archana Ganapathi Department of EECS, UC Berkeley (archanag@cs.berkeley.edu)

Why do we study failure data? • Understand cause->effect relationship between configurations and system behavior • Still don’t have a complete understanding of failures in systems • Can’t worry about fixing problems if we don’t understand them in the first place • Gauge behavioral changes over time • Need realistic workload/faultload data to test/evaluate systems • Success stories…people have benefited from failure data analysis

Crash data collection success stories • Berkeley EECS • BOINC • 2 Unnamed Companies

…So Why Does Windows Crash?

Definitions • Crash • Event caused by a problem in the operating system(OS) or application(app) • Requires OS or app restart. • Application Crash • A crash occurring at user-level, caused by one or more components (.exe/.dll files) • Requires an application restart. • Application Hang • An application crash caused as a result of the user terminating a process that is potentially deadlocked or running an infinite loop. • Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet) • OS Crash • A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines. • Blue-screen-generating crashes require a machine reboot • Windows explorer crashes require restarting the explorer process. • Bluescreen • An OS crash that produces a user-visible blue screen followed by a non-optional machine reboot.

Procedure • Collect crash dumps from two different sources • UC Berkeley EECS department • BOINC volunteers • Filter data/form crash clusters to avoid double-counting • Account for shared resources, dependent processes, system instability, user retry • Parse/Interpret crash dumps using Debugging tools for Windows • Study both application crash behavior and operating systems crashes • Supplement crash data with usage data

EECS Dataset

Crashes reported per month

Usage/Crashes per day of week • EECS department users use their EECS computers Monday through Friday. • Few users use computers on weekends. • Crashes do not occur uniformly across the five days of the working week.

Usage/Crashes per hour of day • Most people work during the typical hours of 9am to 5pm. • Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules

Reboot Frequency

Automatic Clustering Experiment for Categorizing Apps • Augment the crash data with information about usage patterns and program dependencies • Feed data into the k-means and agglomerative clustering algorithms to determine which applications are behaviorally related. • We determined that we did not have enough data to derive a method to categorize applications in our data set • Need several instances of every (application, component, error code) combo • As a last resort, we chose to categorize apps based on categorization based on application functionality

Crash Cause by Application Category

Application Hang vs Crashes due to Faulty Component

Which applications hang?

Which components cause crashes?

BOINC http://winerror.cs.berkeley.edu/crashcollection/ • Berkeley Open Infrastructure for Network Computing • Users download boinc client app • Crash dumps are scraped/sent to boinc servers • Currently 791 accounts created for crash collection + resource management • 492 users for crash collection

OS Crashes • Driver faults • asynchronous events • code must follow kernel programming etiquette • exceedingly difficult to debug • Memory corruption • Hardware problems (e.g. non-ECC mem) • Software-related • 47 of these in our dataset so far…don’t have tools to analyze these in detail

OS crash causing images(based on 150 boinc users, 562 crashes)

Crash generating driver fault type

Summary of crash analysis • Application crashes are caused by both faulty non-robust dll files as well as impatient users • OS crashes are predominantly caused by poorly-written device driver code • Commonly used core components are blamed for most crashes • need to improve reliability of these components

Practical techniques to reduce crashes • Software-Based Fault Isolation • Nooks • Separate protection level for drivers • Move driver code to user libraries • Virtual Machine for each unsafe/distrusted app

Lessons from crash data study • Clearly people want to know what’s wrong and how to fix it • The more feedback we give, the more data sets we receive • ...but it’s not as easy as it sounds

What kinds of data should we collect? • Failure data • Configuration information • Logs of normal behavior • Usage data • Performance logs • Annotations of data • Collect data for Individual Machines + Services

Why are people afraid of sharing data? • Fear of public humiliation (reverse engineering what user was doing) • Revealing problems within their organization • Fear of competitors using data against them • Revealing loopholes through which malware can easily propagate. • Revealing dependability problems in third party products (MS)

Non-technical challenges to getting data • Collecting (useful) data is tedious • What information is “necessary and sufficient” to understand data trends? • Privacy concerns • Especially with usage data • Finding the person with access to data • No central location that can be queried for data • Legal agreements take a long time to draft • Researchers are more willing to share data than lawyers • Publicity

Technical solution • Amortize the cost of data collection by building an open source repository • Provide a set of tools to cleanse and mine the data

What tools should we implement? • Collect • BOINC • Instrumentation (MS, Pinpoint) • Pre-aggregated data from companies • Anonymize/Preprocess • Pre-written anonymization tools • Company-specific privacy requirements • Hash values of certain fields • Drop irrelevent fields • Mask part of data

Tools cont’d • Store • Open-source repository schema • Common log format/ data descriptor headers • Tools to convert log metadata to common format to cross-link data tables • Sample queries: data mining ~ asking questions about data as it is • Analyze/Experiment • SLT algorithms • Visualization • Stream processing • Other tools (eg. WinDbg)

Thoughts on Collection/Anonymization • Defining necessary and sufficient • Bad example: Cannot correlate crashes if we getting rid of all user/machine names • Good example: Hash user/machine names • Default: hide if not necessary? • What would it take for you not to invoke the legal dept?

Thoughts on Storage/Analysis • Use time/data source as primary key? • How domain-specific should the common format be? • Management logistics… • Access control…

Acronym Suggestions??? Open Source (Failure) Data Repository

A Case for an Open Source Data Repository

A Case for an Open Source Data Repository

Presentation Transcript

gvSIG : An “Open Source” Option for GIS

Adopting an Open Source LMS: an Interpretive Case Study

The Business Case for Open Source/Asterisk

Evolution in Open Source Software: A Case Study

An open-source web-based data collection system for ERCs

Evolution in Open Source Software: A Case Study

The Business Case for Open Source/Asterisk

Everyday Requirements for an Open Ontology Repository

An Open Source Linked Data Infrastructure for Publishing Geospatial Data

Open Access Open Source and the Institutional Repository

Open Communication for Open Source

Open Ontology Repository

Open Standards Open Source Open Data

Map matching algorithm for data conflation – an open source approach

Open Source Software: A Case Study

Open Repository

PRIMO A case study of an institutional repository

An Introduction of GIS Open Data Source

Open Source Tools for Data Analysis

Open source tools for data analysis

Business Case for an Open Source Library Management System

Federated Open Data Repository in EGI