Reduce instrumentation predictors using random forests
Download
1 / 32

Reduce Instrumentation Predictors Using Random Forests - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

Reduce Instrumentation Predictors Using Random Forests. Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005. Motivation. Crash report – too late to collect program information until the program crashes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Reduce Instrumentation Predictors Using Random Forests' - davina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Reduce instrumentation predictors using random forests

Reduce Instrumentation Predictors Using Random Forests

Presented By Bin Zhao

Department of Computer Science

University of Maryland

May 3 2005


Motivation
Motivation

  • Crash report – too late to collect program information until the program crashes

  • Testing – large number of test cases. Can we focus on the failing cases?


Motivation failure prediction
Motivation – failure prediction

  • Instrument program to monitor behavior

  • Predict if the program is going to fail

  • Collect program data if the program is predicted to likely fail

  • Stop running the test if the test program is not likely to fail


The problem
The problem

  • Large number of instrumentation predictors

  • What instrumentation predictors to picked?


The questions to answer
The questions to answer

  • Can a good model be found for predicting failing runs based on all available data?

  • Can an equally good model be created based on a random selection of k% of the predictors?


Experiment
Experiment

  • Instrumentation on a calculator program

  • 295 predictors

  • Instrumentation data collected every 50 milli-seconds

  • 100 runs – 81 success, 19 failure

  • Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10


Sample data
Sample data

  • Pass Run

    Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3

    1 pass 1 3244 0 3244 0

    1 pass 2 3206 0 3206 0

    1 pass 3 3232 0 3232 0

    1 pass 4 3203 0 3203 0

    1 pass 5 3243 0 3243 0

  • Failure Run

    Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3

    10 fail 1 3200 0 3200 0

    10 fail 2 3200 0 3200 0

    10 fail 3 3251 0 3251 0

    10 fail 4 3251 0 3251 0

    10 fail 5 3248 0 3248 0


Background random forests
Background – Random Forests

  • Many classification trees

  • Each tree gives a classification – vote

  • The classification is chosen by the most votes


Background random forests1
Background – Random Forests

  • Need a training set to grow the forests

  • M predictors are randomly selected at each node to split the node (mtry)

  • One-third of the training data (oob) is used to get an estimation error


Background random forests2
Background – Random Forests

  • To classify a test run as pass or fail

  • Sample model estimation

    OOB error rate: 0.0044

    "fail" "pass" "class.error"

    "fail" 933 17 0.0178947368421053

    "pass" 5 4045 0.00123456790123455


Background r
Background - R

  • Software for data manipulation, analysis and calculation

  • Provide script capability

  • Provide an implementation of Random Forests


Experiment steps
Experiment steps

  • Determine which slice of the data to be used as modeling and testing

  • Find which parameter (ntree, mtry) affect the model

  • Find the optimal parameter values for all the random models

  • Build the random models by randomly picking N predictors

  • Verify the random models by prediction



Influential parameters in random forest
Influential parameters in Random Forest

  • Two possible parameters – ntree and mtry

  • Building model by fixing either ntree or mtry and vary the other variable

  • Ntree: 200 – 1000

  • Mtry: 10 – 295

  • Only Mtry matters


Optimal mtry
Optimal mtry

  • Need to decide optimal mtry for different number of predictors (N)

  • The default mtry is square root of N

  • For different number of predicator (295 – 10): N/2 – 3N


Random model
Random model

  • Randomly pick the predictors from the full set of the predictors

  • Generate 5 sets of data for each number of predictor

  • Use the 5 sets of the data to build the random forest model and average the result


Random prediction
Random prediction

  • For each trained random forest, do prediction on a total different set of test data (records 401 – 450)




Important predictors
Important predictors

  • Random Forests can give importance to each predictor – the number of correct votes involving the predictor

  • Top 20 important predictors

    DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11

    AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6

    AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9

    DataItem9 AC-DataItem9 DataItem6 DataItem12

    MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12


Top model
Top model

  • Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)



Observation and analysis
Observation and analysis

  • The fail error rate is still high (> 30%)

  • No all the runs fail at the same time

  • Fail:Success = 19:81 (too few fail cases to build a good model)

  • Some predictors are raw, while others are derived – MSF, AC, PC, RT


Improvements
Improvements

  • Get the last N records for a particular run

  • For a set of data, randomly drop some pass data and duplicate the fail data

  • Randomly pick the raw predictors then all its derived predictors




Conclusion so far
Conclusion so far

  • Random selection does not achieve a good error rate

  • Some predictors have a stronger prediction power

  • A small set of important predictor can achieve good error rate


Future work
Future work

  • Why some predictors have stronger prediction power?

  • Any pattern for the important predictors?

  • How many important predictors should we pick?

  • How soon can we predict a fail run before it actually fails?





ad