200 likes | 363 Views
KDD Cup 2004. Winning Model for Task 1: Particle Physics Prediction. David S. Vogel : MEDai / AI Insight, University of Central Florida Eric Gottschalk : MEDai / AI Insight Morgan C. Wang : University of Central Florida Orlando, FL. What did we know?. Given 12 million numbers.
E N D
KDD Cup 2004 Winning Model for Task 1: Particle Physics Prediction David S. Vogel: MEDai / AI Insight, University of Central Florida Eric Gottschalk: MEDai / AI Insight Morgan C. Wang: University of Central Florida Orlando, FL
What did we know? • Given 12 million numbers. • No information given about what these numbers represent. • No knowledge of particle physics. • Predict 100,000 ones and zeros.
Unsuccessful Modeling Packages • Software #1: Tree-based boosting algorithms • Software #2: Logistic Regression and Neural Networks • Software #3: Support Vector Machines • Software #4: Rule-finding algorithms
Key Modeling Tools • MITCH(Multiple Intelligent Tasking Computer Heuristics) • Used for its visualizations, variable analysis, transformations, Neural Networks, and scoring tools. • NICA(Numerical Interaction CAlibrator) • Used to detect interactions within the data.
Category Analysis • Nearly one tenth of records are 100% predictive.
Investigation of Variables • Group 1: 8 variables with values {-1,0,1}. Interactive and symmetric. • Group 2: A key nominal variable. • Group 3: 6 individually predictive variables. • Group 4: All others variables, no correlation to dependent variable.
Predictor V01: r=.006 Class 1 Probability V01
Predictor V01 where V04=1 Class 1 Probability V01
Predictor V01 where V04=-1 Class 1 Probability V01
Predictor V04*(V01-0.75): r=.23 Class 1 Probability V01
Interactions between variables: Red: Extremely Strong Green: Strong Yellow: Moderate (p<.01)
Details of 639 Predictors • Majority of original variables (after null value replacement) • 100% predictive groups • High volume categories of the nominal variable • 2 variables indicating null values • 72 first order interactions • 185 second order interactions • 301 third order interactions
Model Details • 40,000 training cases • 10,000 validation cases • MITCH Self-Organizing Neural Network • “Bernoulli” function optimization generally performed the best • Generalized extremely well on validation set, considering the number of variables • Small secondary model based on residuals
Customization • Severe penalty for incorrect probabilities of 0 or 1: a “googol”!!! • “Gimmees” forced to be at 0.995 or 0.005. Accept 9300 tiny penalties to avoid risking “disaster.” • 14 teams had a “disaster.” • Remaining predictions truncated at 0.01 and 0.99 to compensate for over-fitting at extremes.
Customization (continued) • Q-Score predictions were maximized by retraining with a “creative” optimization function: (Predicted – Actual) ^ 6. • Predictions re-calibrated using the function:
Where do we go from here? • Accuracy -- independent of content • Scientific & Industry Applications