Joni Nunnery and Helmut Schneider

Using Data Mining and Bootstrapping to Develop Simple Models for Obtaining Confidence Intervals for the Percentage of Alcohol Related Crashes Joni Nunnery and Helmut Schneider

Why Data Mining? • NHTSA Estimate is for the USA • State estimates are not readily available • Need for reliable standard errors for states • 0.3% for USA 2% for LA • State estimate may be effected by local variables • Non-crash independent variables may change over time • DWI versus pretrial diversion • IM estimates complicated statistical technique • Data Mining tools are used in various applications

Approach • Analysis of Louisiana Crash Data 1999-2002 • Data mining model is used to predict alcohol involvement • Estimation of standard error via bootstrap type simulation

KNOWN ALCOHOL TESTS RESULTSLOUISIANAN 1999-2002

ROW PERCENTAGES

All Drivers in CrashesLouisiana 1999-2002

Using Insightful Miner Data Mining Software

Classification Models • Logistic Regression • Naive Bayes • Neural Network • Classification Tree

Classification Tree • Fit model to half the data • Tree model • What did we learn? – • Importance of variables

Classification Results

Violation Hour of Day Vehicle Type Age Injury Parish Number of Vehicles Belt Usage Day of Week Gender

Alcohol in Injury and Property Damage Crashes

Standard Error • Using simulation on second half of data set to get estimated error • Evaluate combined standard error • The resulting standard error is 1% for 900 crashes

Conclusion • Data mining is a simple and useful tool to predict missing observations • The best predictor for alcohol related crashes are the judgment of a well trained police officer on the scene

Alcohol-Related Crashes in Louisiana by Highway Return

Joni Nunnery and Helmut Schneider