A case for an open source data repository
1 / 32

A Case for an Open Source Data Repository - PowerPoint PPT Presentation

  • Uploaded on

A Case for an Open Source Data Repository. Archana Ganapathi Department of EECS, UC Berkeley (archanag@cs.berkeley.edu). Why do we study failure data?. Understand cause->effect relationship between configurations and system behavior

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'A Case for an Open Source Data Repository' - calista

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A case for an open source data repository

A Case for an Open Source Data Repository

Archana Ganapathi

Department of EECS, UC Berkeley


Why do we study failure data
Why do we study failure data?

  • Understand cause->effect relationship between configurations and system behavior

  • Still don’t have a complete understanding of failures in systems

    • Can’t worry about fixing problems if we don’t understand them in the first place

  • Gauge behavioral changes over time

  • Need realistic workload/faultload data to test/evaluate systems

  • Success stories…people have benefited from failure data analysis

Crash data collection success stories
Crash data collection success stories

  • Berkeley EECS


  • 2 Unnamed Companies


  • Crash

    • Event caused by a problem in the operating system(OS) or application(app)

    • Requires OS or app restart.

  • Application Crash

    • A crash occurring at user-level, caused by one or more components (.exe/.dll files)

    • Requires an application restart.

  • Application Hang

    • An application crash caused as a result of the user terminating a process that is potentially deadlocked or running an infinite loop.

    • Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet)

  • OS Crash

    • A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines.

    • Blue-screen-generating crashes require a machine reboot

    • Windows explorer crashes require restarting the explorer process.

  • Bluescreen

    • An OS crash that produces a user-visible blue screen followed by a non-optional machine reboot.


  • Collect crash dumps from two different sources

    • UC Berkeley EECS department

    • BOINC volunteers

  • Filter data/form crash clusters to avoid double-counting

    • Account for shared resources, dependent processes, system instability, user retry

  • Parse/Interpret crash dumps using Debugging tools for Windows

  • Study both application crash behavior and operating systems crashes

    • Supplement crash data with usage data

Usage crashes per day of week
Usage/Crashes per day of week

  • EECS department users use their EECS computers Monday through Friday.

  • Few users use computers on weekends.

  • Crashes do not occur uniformly across the five days of the working week.

Usage crashes per hour of day
Usage/Crashes per hour of day

  • Most people work during the typical hours of 9am to 5pm.

  • Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules

Automatic clustering experiment for categorizing apps
Automatic Clustering Experiment for Categorizing Apps

  • Augment the crash data with information about usage patterns and program dependencies

  • Feed data into the k-means and agglomerative clustering algorithms to determine which applications are behaviorally related.

  • We determined that we did not have enough data to derive a method to categorize applications in our data set

    • Need several instances of every (application, component, error code) combo

  • As a last resort, we chose to categorize apps based on categorization based on application functionality

Boinc http winerror cs berkeley edu crashcollection
BOINC http://winerror.cs.berkeley.edu/crashcollection/

  • Berkeley Open Infrastructure for Network Computing

  • Users download boinc client app

  • Crash dumps are scraped/sent to boinc servers

  • Currently 791 accounts created for crash collection + resource management

    • 492 users for crash collection

Os crashes
OS Crashes

  • Driver faults

    • asynchronous events

    • code must follow kernel programming etiquette

    • exceedingly difficult to debug

  • Memory corruption

    • Hardware problems (e.g. non-ECC mem)

    • Software-related

    • 47 of these in our dataset so far…don’t have tools to analyze these in detail

Summary of crash analysis
Summary of crash analysis crashes)

  • Application crashes are caused by both faulty non-robust dll files as well as impatient users

  • OS crashes are predominantly caused by poorly-written device driver code

  • Commonly used core components are blamed for most crashes

    • need to improve reliability of these components

Practical techniques to reduce crashes
Practical techniques to reduce crashes crashes)

  • Software-Based Fault Isolation

  • Nooks

  • Separate protection level for drivers

  • Move driver code to user libraries

  • Virtual Machine for each unsafe/distrusted app

Lessons from crash data study
Lessons from crash data study crashes)

  • Clearly people want to know what’s wrong and how to fix it

  • The more feedback we give, the more data sets we receive

  • ...but it’s not as easy as it sounds

What kinds of data should we collect
What kinds of data should we collect? crashes)

  • Failure data

  • Configuration information

  • Logs of normal behavior

  • Usage data

  • Performance logs

  • Annotations of data

  • Collect data for Individual Machines + Services

Why are people afraid of sharing data
Why are people afraid of sharing data? crashes)

  • Fear of public humiliation (reverse engineering what user was doing)

  • Revealing problems within their organization

  • Fear of competitors using data against them

  • Revealing loopholes through which malware can easily propagate.

  • Revealing dependability problems in third party products (MS)

Non technical challenges to getting data
Non-technical challenges to getting data crashes)

  • Collecting (useful) data is tedious

    • What information is “necessary and sufficient” to understand data trends?

  • Privacy concerns

    • Especially with usage data

  • Finding the person with access to data

    • No central location that can be queried for data

  • Legal agreements take a long time to draft

    • Researchers are more willing to share data than lawyers

  • Publicity

Technical solution
Technical solution crashes)

  • Amortize the cost of data collection by building an open source repository

  • Provide a set of tools to cleanse and mine the data

What tools should we implement
What tools should we implement? crashes)

  • Collect

    • BOINC

    • Instrumentation (MS, Pinpoint)

    • Pre-aggregated data from companies

  • Anonymize/Preprocess

    • Pre-written anonymization tools

    • Company-specific privacy requirements

      • Hash values of certain fields

      • Drop irrelevent fields

      • Mask part of data

Tools cont d
Tools cont’d crashes)

  • Store

    • Open-source repository schema

    • Common log format/ data descriptor headers

    • Tools to convert log metadata to common format to cross-link data tables

    • Sample queries: data mining ~ asking questions about data as it is

  • Analyze/Experiment

    • SLT algorithms

    • Visualization

    • Stream processing

    • Other tools (eg. WinDbg)

Thoughts on collection anonymization
Thoughts on Collection/Anonymization crashes)

  • Defining necessary and sufficient

    • Bad example: Cannot correlate crashes if we getting rid of all user/machine names

    • Good example: Hash user/machine names

  • Default: hide if not necessary?

  • What would it take for you not to invoke the legal dept?

Thoughts on storage analysis
Thoughts on Storage/Analysis crashes)

  • Use time/data source as primary key?

  • How domain-specific should the common format be?

  • Management logistics…

  • Access control…

Acronym suggestions
Acronym Suggestions??? crashes)

Open Source (Failure) Data Repository