A case for an open source data repository
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

A Case for an Open Source Data Repository PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on
  • Presentation posted in: General

A Case for an Open Source Data Repository. Archana Ganapathi Department of EECS, UC Berkeley ([email protected]). Why do we study failure data?. Understand cause->effect relationship between configurations and system behavior

Download Presentation

A Case for an Open Source Data Repository

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A case for an open source data repository

A Case for an Open Source Data Repository

Archana Ganapathi

Department of EECS, UC Berkeley

([email protected])


Why do we study failure data

Why do we study failure data?

  • Understand cause->effect relationship between configurations and system behavior

  • Still don’t have a complete understanding of failures in systems

    • Can’t worry about fixing problems if we don’t understand them in the first place

  • Gauge behavioral changes over time

  • Need realistic workload/faultload data to test/evaluate systems

  • Success stories…people have benefited from failure data analysis


Crash data collection success stories

Crash data collection success stories

  • Berkeley EECS

  • BOINC

  • 2 Unnamed Companies


So why does windows crash

…So Why Does Windows Crash?


Definitions

Definitions

  • Crash

    • Event caused by a problem in the operating system(OS) or application(app)

    • Requires OS or app restart.

  • Application Crash

    • A crash occurring at user-level, caused by one or more components (.exe/.dll files)

    • Requires an application restart.

  • Application Hang

    • An application crash caused as a result of the user terminating a process that is potentially deadlocked or running an infinite loop.

    • Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet)

  • OS Crash

    • A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty system-level routines.

    • Blue-screen-generating crashes require a machine reboot

    • Windows explorer crashes require restarting the explorer process.

  • Bluescreen

    • An OS crash that produces a user-visible blue screen followed by a non-optional machine reboot.


Procedure

Procedure

  • Collect crash dumps from two different sources

    • UC Berkeley EECS department

    • BOINC volunteers

  • Filter data/form crash clusters to avoid double-counting

    • Account for shared resources, dependent processes, system instability, user retry

  • Parse/Interpret crash dumps using Debugging tools for Windows

  • Study both application crash behavior and operating systems crashes

    • Supplement crash data with usage data


Eecs dataset

EECS Dataset


Crashes reported per month

Crashes reported per month


Usage crashes per day of week

Usage/Crashes per day of week

  • EECS department users use their EECS computers Monday through Friday.

  • Few users use computers on weekends.

  • Crashes do not occur uniformly across the five days of the working week.


Usage crashes per hour of day

Usage/Crashes per hour of day

  • Most people work during the typical hours of 9am to 5pm.

  • Our data set involves users of various affiliations to the department, hence the wider spectrum of work schedules


Reboot frequency

Reboot Frequency


Automatic clustering experiment for categorizing apps

Automatic Clustering Experiment for Categorizing Apps

  • Augment the crash data with information about usage patterns and program dependencies

  • Feed data into the k-means and agglomerative clustering algorithms to determine which applications are behaviorally related.

  • We determined that we did not have enough data to derive a method to categorize applications in our data set

    • Need several instances of every (application, component, error code) combo

  • As a last resort, we chose to categorize apps based on categorization based on application functionality


Crash cause by application category

Crash Cause by Application Category


Application hang vs crashes due to faulty component

Application Hang vs Crashes due to Faulty Component


Which applications hang

Which applications hang?


Which components cause crashes

Which components cause crashes?


Boinc http winerror cs berkeley edu crashcollection

BOINChttp://winerror.cs.berkeley.edu/crashcollection/

  • Berkeley Open Infrastructure for Network Computing

  • Users download boinc client app

  • Crash dumps are scraped/sent to boinc servers

  • Currently 791 accounts created for crash collection + resource management

    • 492 users for crash collection


Os crashes

OS Crashes

  • Driver faults

    • asynchronous events

    • code must follow kernel programming etiquette

    • exceedingly difficult to debug

  • Memory corruption

    • Hardware problems (e.g. non-ECC mem)

    • Software-related

    • 47 of these in our dataset so far…don’t have tools to analyze these in detail


Os crash causing images based on 150 boinc users 562 crashes

OS crash causing images(based on 150 boinc users, 562 crashes)


Crash generating driver fault type

Crash generating driver fault type


Summary of crash analysis

Summary of crash analysis

  • Application crashes are caused by both faulty non-robust dll files as well as impatient users

  • OS crashes are predominantly caused by poorly-written device driver code

  • Commonly used core components are blamed for most crashes

    • need to improve reliability of these components


Practical techniques to reduce crashes

Practical techniques to reduce crashes

  • Software-Based Fault Isolation

  • Nooks

  • Separate protection level for drivers

  • Move driver code to user libraries

  • Virtual Machine for each unsafe/distrusted app


Lessons from crash data study

Lessons from crash data study

  • Clearly people want to know what’s wrong and how to fix it

  • The more feedback we give, the more data sets we receive

  • ...but it’s not as easy as it sounds


What kinds of data should we collect

What kinds of data should we collect?

  • Failure data

  • Configuration information

  • Logs of normal behavior

  • Usage data

  • Performance logs

  • Annotations of data

  • Collect data for Individual Machines + Services


Why are people afraid of sharing data

Why are people afraid of sharing data?

  • Fear of public humiliation (reverse engineering what user was doing)

  • Revealing problems within their organization

  • Fear of competitors using data against them

  • Revealing loopholes through which malware can easily propagate.

  • Revealing dependability problems in third party products (MS)


Non technical challenges to getting data

Non-technical challenges to getting data

  • Collecting (useful) data is tedious

    • What information is “necessary and sufficient” to understand data trends?

  • Privacy concerns

    • Especially with usage data

  • Finding the person with access to data

    • No central location that can be queried for data

  • Legal agreements take a long time to draft

    • Researchers are more willing to share data than lawyers

  • Publicity


Technical solution

Technical solution

  • Amortize the cost of data collection by building an open source repository

  • Provide a set of tools to cleanse and mine the data


What tools should we implement

What tools should we implement?

  • Collect

    • BOINC

    • Instrumentation (MS, Pinpoint)

    • Pre-aggregated data from companies

  • Anonymize/Preprocess

    • Pre-written anonymization tools

    • Company-specific privacy requirements

      • Hash values of certain fields

      • Drop irrelevent fields

      • Mask part of data


Tools cont d

Tools cont’d

  • Store

    • Open-source repository schema

    • Common log format/ data descriptor headers

    • Tools to convert log metadata to common format to cross-link data tables

    • Sample queries: data mining ~ asking questions about data as it is

  • Analyze/Experiment

    • SLT algorithms

    • Visualization

    • Stream processing

    • Other tools (eg. WinDbg)


Thoughts on collection anonymization

Thoughts on Collection/Anonymization

  • Defining necessary and sufficient

    • Bad example: Cannot correlate crashes if we getting rid of all user/machine names

    • Good example: Hash user/machine names

  • Default: hide if not necessary?

  • What would it take for you not to invoke the legal dept?


Thoughts on storage analysis

Thoughts on Storage/Analysis

  • Use time/data source as primary key?

  • How domain-specific should the common format be?

  • Management logistics…

  • Access control…


Acronym suggestions

Acronym Suggestions???

Open Source (Failure) Data Repository


  • Login