KDD Cup 2004

KDD Cup 2004 Winning Model for Task 1: Particle Physics Prediction David S. Vogel: MEDai / AI Insight, University of Central Florida Eric Gottschalk: MEDai / AI Insight Morgan C. Wang: University of Central Florida Orlando, FL

What did we know? • Given 12 million numbers. • No information given about what these numbers represent. • No knowledge of particle physics. • Predict 100,000 ones and zeros.

Unsuccessful Modeling Packages • Software #1: Tree-based boosting algorithms • Software #2: Logistic Regression and Neural Networks • Software #3: Support Vector Machines • Software #4: Rule-finding algorithms

Key Modeling Tools • MITCH(Multiple Intelligent Tasking Computer Heuristics) • Used for its visualizations, variable analysis, transformations, Neural Networks, and scoring tools. • NICA(Numerical Interaction CAlibrator) • Used to detect interactions within the data.

Category Analysis • Nearly one tenth of records are 100% predictive.

Investigation of Variables • Group 1: 8 variables with values {-1,0,1}. Interactive and symmetric. • Group 2: A key nominal variable. • Group 3: 6 individually predictive variables. • Group 4: All others variables, no correlation to dependent variable.

Complete Interaction Search

Predictor V01: r=.006 Class 1 Probability V01

Predictor V01 where V04=1 Class 1 Probability V01

Predictor V01 where V04=-1 Class 1 Probability V01

Predictor V04*(V01-0.75): r=.23 Class 1 Probability V01

Interactions between variables: Red: Extremely Strong Green: Strong Yellow: Moderate (p<.01)

Details of 639 Predictors • Majority of original variables (after null value replacement) • 100% predictive groups • High volume categories of the nominal variable • 2 variables indicating null values • 72 first order interactions • 185 second order interactions • 301 third order interactions

Model Details • 40,000 training cases • 10,000 validation cases • MITCH Self-Organizing Neural Network • “Bernoulli” function optimization generally performed the best • Generalized extremely well on validation set, considering the number of variables • Small secondary model based on residuals

Customization • Severe penalty for incorrect probabilities of 0 or 1: a “googol”!!! • “Gimmees” forced to be at 0.995 or 0.005. Accept 9300 tiny penalties to avoid risking “disaster.” • 14 teams had a “disaster.” • Remaining predictions truncated at 0.01 and 0.99 to compensate for over-fitting at extremes.

Customization (continued) • Q-Score predictions were maximized by retraining with a “creative” optimization function: (Predicted – Actual) ^ 6. • Predictions re-calibrated using the function:

Where do we go from here? • Accuracy -- independent of content • Scientific & Industry Applications

Questions?

KDD Cup 2004

KDD Cup 2004

Presentation Transcript

KDD Cup 2009

Download Estimation for KDD Cup 2003

KDD-Cup 2004

KDD Cup Survey

KDD Cup 2009

KDD Cup Task 2

KDD-Cup 2000 Peeling the Onion

Targeted Marketing, KDD Cup and Customer Modeling

KDD CUP 2007

Kdd Cup 2013 Author Paper Identification Final Report

KDD CUP 2007

KDD CUP 2007

KDD-Cup A Survey: 1997-201 2

KDD Cup 2000 Question 5

ACM KDD Cup A Survey: 1997-2011

Weka solution for the 2004 KDD Cup Protein Homology Prediction task

KDD Cup 2000 Question 1

KDD CUP 2001 Task 1: Thrombin

KDD Cup 2007 Task I Algorithm & Analysis

KDD-2001 Cup The Genomics Challenge

KDD Cup 2004

KDD Cup 2004

Presentation Transcript

KDD Cup 2009

Download Estimation for KDD Cup 2003

KDD-Cup 2004

KDD Cup Survey

KDD Cup 2009

KDD Cup Task 2

KDD-Cup 2000 Peeling the Onion

Targeted Marketing, KDD Cup and Customer Modeling

KDD CUP 2007

Kdd Cup 2013 Author Paper Identification Final Report

KDD CUP 2007

KDD CUP 2007

KDD-Cup A Survey: 1997-201 2

KDD Cup 2000 Question 5

ACM KDD Cup A Survey: 1997-2011

Weka solution for the 2004 KDD Cup Protein Homology Prediction task

KDD Cup 2000 Question 1

KDD CUP 2001 Task 1: Thrombin

KDD Cup 2007 Task I Algorithm &amp; Analysis

KDD-2001 Cup The Genomics Challenge

KDD Cup 2007 Task I Algorithm & Analysis