Reduce Instrumentation Predictors Using Random Forests

1 / 32

# Reduce Instrumentation Predictors Using Random Forests - PowerPoint PPT Presentation

Reduce Instrumentation Predictors Using Random Forests. Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005. Motivation. Crash report – too late to collect program information until the program crashes

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Reduce Instrumentation Predictors Using Random Forests' - davina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Reduce Instrumentation Predictors Using Random Forests

Presented By Bin Zhao

Department of Computer Science

University of Maryland

May 3 2005

Motivation
• Crash report – too late to collect program information until the program crashes
• Testing – large number of test cases. Can we focus on the failing cases?
Motivation – failure prediction
• Instrument program to monitor behavior
• Predict if the program is going to fail
• Collect program data if the program is predicted to likely fail
• Stop running the test if the test program is not likely to fail
The problem
• Large number of instrumentation predictors
• What instrumentation predictors to picked?
• Can a good model be found for predicting failing runs based on all available data?
• Can an equally good model be created based on a random selection of k% of the predictors?
Experiment
• Instrumentation on a calculator program
• 295 predictors
• Instrumentation data collected every 50 milli-seconds
• 100 runs – 81 success, 19 failure
• Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10
Sample data
• Pass Run

Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3

1 pass 1 3244 0 3244 0

1 pass 2 3206 0 3206 0

1 pass 3 3232 0 3232 0

1 pass 4 3203 0 3203 0

1 pass 5 3243 0 3243 0

• Failure Run

Run Res Rec 0x40bda0 DataItem3 MSF-0x40bda0 MSF-DataItem3

10 fail 1 3200 0 3200 0

10 fail 2 3200 0 3200 0

10 fail 3 3251 0 3251 0

10 fail 4 3251 0 3251 0

10 fail 5 3248 0 3248 0

Background – Random Forests
• Many classification trees
• Each tree gives a classification – vote
• The classification is chosen by the most votes
Background – Random Forests
• Need a training set to grow the forests
• M predictors are randomly selected at each node to split the node (mtry)
• One-third of the training data (oob) is used to get an estimation error
Background – Random Forests
• To classify a test run as pass or fail
• Sample model estimation

OOB error rate: 0.0044

"fail" "pass" "class.error"

"fail" 933 17 0.0178947368421053

"pass" 5 4045 0.00123456790123455

Background - R
• Software for data manipulation, analysis and calculation
• Provide script capability
• Provide an implementation of Random Forests
Experiment steps
• Determine which slice of the data to be used as modeling and testing
• Find which parameter (ntree, mtry) affect the model
• Find the optimal parameter values for all the random models
• Build the random models by randomly picking N predictors
• Verify the random models by prediction
Influential parameters in Random Forest
• Two possible parameters – ntree and mtry
• Building model by fixing either ntree or mtry and vary the other variable
• Ntree: 200 – 1000
• Mtry: 10 – 295
• Only Mtry matters
Optimal mtry
• Need to decide optimal mtry for different number of predictors (N)
• The default mtry is square root of N
• For different number of predicator (295 – 10): N/2 – 3N
Random model
• Randomly pick the predictors from the full set of the predictors
• Generate 5 sets of data for each number of predictor
• Use the 5 sets of the data to build the random forest model and average the result
Random prediction
• For each trained random forest, do prediction on a total different set of test data (records 401 – 450)
Important predictors
• Random Forests can give importance to each predictor – the number of correct votes involving the predictor
• Top 20 important predictors

DataItem11 RT-DataItem11 PC-DataItem11 MSF-DataItem11

AC-DataItem11 RT-DataItem9 RT-DataItem6 PC-DataItem6

AC-DataItem6 MSF-DataItem9 MSF-DataItem6 PC-DataItem9

DataItem9 AC-DataItem9 DataItem6 DataItem12

MSF-DataItem12 AC-DataItem12 RT-DataItem12 PC-DataItem12

Top model
• Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)
Observation and analysis
• The fail error rate is still high (> 30%)
• No all the runs fail at the same time
• Fail:Success = 19:81 (too few fail cases to build a good model)
• Some predictors are raw, while others are derived – MSF, AC, PC, RT
Improvements
• Get the last N records for a particular run
• For a set of data, randomly drop some pass data and duplicate the fail data
• Randomly pick the raw predictors then all its derived predictors
Conclusion so far
• Random selection does not achieve a good error rate
• Some predictors have a stronger prediction power
• A small set of important predictor can achieve good error rate
Future work
• Why some predictors have stronger prediction power?
• Any pattern for the important predictors?
• How many important predictors should we pick?
• How soon can we predict a fail run before it actually fails?