170 likes | 327 Views
Using Data Mining and Bootstrapping to Develop Simple Models for Obtaining Confidence Intervals for the Percentage of Alcohol Related Crashes. Joni Nunnery and Helmut Schneider. Why Data Mining?. NHTSA Estimate is for the USA State estimates are not readily available
E N D
Using Data Mining and Bootstrapping to Develop Simple Models for Obtaining Confidence Intervals for the Percentage of Alcohol Related Crashes Joni Nunnery and Helmut Schneider
Why Data Mining? • NHTSA Estimate is for the USA • State estimates are not readily available • Need for reliable standard errors for states • 0.3% for USA 2% for LA • State estimate may be effected by local variables • Non-crash independent variables may change over time • DWI versus pretrial diversion • IM estimates complicated statistical technique • Data Mining tools are used in various applications
Approach • Analysis of Louisiana Crash Data 1999-2002 • Data mining model is used to predict alcohol involvement • Estimation of standard error via bootstrap type simulation
Classification Models • Logistic Regression • Naive Bayes • Neural Network • Classification Tree
Classification Tree • Fit model to half the data • Tree model • What did we learn? – • Importance of variables
Violation Hour of Day Vehicle Type Age Injury Parish Number of Vehicles Belt Usage Day of Week Gender
Standard Error • Using simulation on second half of data set to get estimated error • Evaluate combined standard error • The resulting standard error is 1% for 900 crashes
Conclusion • Data mining is a simple and useful tool to predict missing observations • The best predictor for alcohol related crashes are the judgment of a well trained police officer on the scene