Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob
Overview • Critical stages of data mining process • Choosing the right data, people, and problems • Modeling • Validation • Automated modeling • Feature creation and selection • Exploiting expert knowledge, “insights” • Applications • Little detail – Biomedical: finding predictive risk factors • More detail – Financial: predicting returns on the market • Lots of detail – Credit: anticipating the onset of bankruptcy
Predicting Health Risk • Who is at risk for a disease? • Example: detect osteoporosis without expense of x-ray • Goals • Improving public health • Savings on medical care • Confirm an informal model with data mining • Many types of features, interested groups • Clinical observations of doctors • Laboratory measurements, “genetic” • Self-reported behavior • Missing data
Predicting the Stock Market • Small, “hands-on” example • Goals • Better retirement savings? • Money for that special vacation? College? • Trade-offs: risk vs return • Lots of “free” data • Access to accurate historical time trends, macro factors • Recent data more useful than older data • “Simple” modeling technique • Validation
Predicting the Market: Specifics • Build a regression model • Response is return on the value-weighted S&P • Use standard forward/backward stepwise • Battery of 12 predictors with interactions • Train the model during 1992-1996 (training data) • Model captures most of variation in 5 years of returns • Retain only the most significant features (Bonferroni) • Predict returns in 1997 (validation data) • Another version in Foster, Stine & Waterman
Fitted model predicts... Exceptional Feb return?
What happened? Training Period
Claimed versus Actual Error Actual SquaredPredictionError Claimed
Over-confidence? • Over-fitting • Model fits the training data too well – better than it can predict the future. • Greedy fitting procedure “Optimization capitalizes on chance” • Some intuition • Coincidences • Cancer clusters, the “birthday problem” • Illustration with an auction • What is the value of the coins in this jar?
Auctions and Over-fitting What is the value of these coins?
Auctions and Over-fitting • Auction jar of coins to a class of MBA students • Histogram shows the bids of 30 students • Most were suspicious, but a few were not! • Actual value is $3.85 • Known as “Winner’s Curse” • Similar to over-fitting:best model like high bidder
Profiting from data mining? • Where’s the profit in this? • “Mining the miners” vs getting value from your data • Lost opportunities • Importance of domain knowledge • Validation as a measure of success • Prediction provides an explicit check • Does your application predict something?
Pitfalls and Role of Management Over-fitting is dominated by other issues… • Management support • Life in silos • Coordination across domains • Responsibility and reward • Accountability • Who gets the credit when it succeeds?Who suffers if the project is not successful?
Specific Potholes • Moving targets • “Let’s try this with something else.” • Irrational expectations • “I could have done better than that.” • Not with my data • “It’s our data. You can’t use it.” • “You did not use our data properly.”
Back to a real application… Emphasis on the statistical issues…
Predicting Bankruptcy • Goal • Reduce losses stemming from personal bankruptcy • Possible strategies • If can identify those with highest risk of bankruptcy…Take some action • Call them for a “friendly chat” about circumstances • Unilaterally reduce credit limit • Trade-off • Good customers borrow lots of money • Bad customers also borrow lots of money
Predicting Bankruptcy • “Needle in a haystack” • 3,000,000 months of credit-card activity • 2244 bankruptcies • Simple predictor that all are OK looks pretty good. • What factors anticipate bankruptcy? • Spending patterns? Payment history? • Demographics? Missing data? • Combinations of factors? • Cash Advance + Las Vegas = Problem • We consider more than 100,000 predictors!
Modeling: Predictive Models • Build the modelIdentify patterns in training data that predict future observations. • Which features are real? Coincidental? • Evaluate the modelHow do you know that it works? • During the model construction phase • Only incorporate meaningful features • After the model is built • Validate by predicting new observations
Are all prediction errors the same? • Symmetry • Is over-predicting as costly as under-predicting? • Managing inventories and sales • Visible costs versus hidden costs • Does a false positive = a false negative? • Classification in data mining • Credit modeling, flagging “risky” customers • False positive: call a good customer “bad” • False negative: fail to identify a “bad” • Differential costs for different types of errors
Building a Predictive Model So many choices… • Structure: What type of model? • Neural net • CART, classification tree • Additive model or regression spline • Identification: Which features to use? • Time lags, “natural” transformations • Combinations of other features • Search: How does one find these features? • Brute force has become cheap.
Our Choices • Structure • Linear regression with nonlinearity via interactions • All 2-way and some 3-way, 4-way interactions • Missing data handled with indicators • Identification • Conservative standard error • Comparison of conservative t-ratio to adaptive threshold • Search • Forward stepwise regression • Coming: Dynamically changing list of features • Good choice affects where you search next.
Identifying Predictive Features • Classical problem of “variable selection” • Thresholding methods (compare t-ratio to threshold) • Akaike information criterion (AIC) • Bayes information criterion (BIC) • Hard thresholding and Bonferroni • Arguments for adaptive thresholds • Empirical Bayes • Information theory • Step-up/step-down tests
Adaptive Thresholding • Threshold changes to conform to attributes of data • Easier to add features as more are found. • Threshold for first predictor • Compare conservative t-ratio to Bonferroni. • Bonferroni is about Sqrt(2 log p) • If something significant is found, continue. • Threshold for second predictor • Compare t-ratio to reduced threshold • New threshold is about Sqrt(2 log p/2)
Adaptive Thresholding: Benefits • EasyAs easy and fast as implementing the standard criterion that is used in stepwise regression. • TheoryResulting model provably as good as best Bayes model for the problem at hand. • Real worldIt works! Finds models with real signal, and stops when the signal runs out.
Bankruptcy Model: Construction • Data: reserve 80% for validation • Training data • 600,000 months • 458 bankruptcies • Validation data • 2,400,000 months • 1786 bankruptcies • Selection via adaptive thresholding • Compare sequence of t-statistics to Sqrt(2 log p/q) • Dynamic expansion of feature space
Bankruptcy Model: Preview • Predictors • Initial search identifies 39 • Validation SS monotonically falls to 1650 • Linear fit can do no better than 1735 • Expanded search of higher interactions finds a bit more • Nature of predictors comprising the interactions • Validation SS drops 10 more • Validation: Lift chart • Top 1000 candidates have 351 bankrupt • More validation: Calibration • Close to actual Pr(bankrupt) for most groups.
Bankruptcy Model: Fitting • Where should the fitting process be stopped?
Bankruptcy Model: Fitting • Our adaptive selection procedure stops at a model with 39 predictors.
Bankruptcy Model: Validation • The validation indicates that the fit gets better while the model expands. Avoids over-fitting.
Bankruptcy Model: Linear? • Choosing from linear predictors (no interactions) does not match the performance of the full search.
Bankruptcy Model: More? • Searching higher-order interactions offers modest improvement.
Lift Chart • Measures how well model classifies sought-for group • Depends on rule used to label customers • Very high thresholdLots of lift, but few bankrupt customers are found. • Lower thresholdLift drops, but finds more bankrupt customers.
Generic Lift Chart Model Random
Bankruptcy Model: Lift • Much better than diagonal!
Calibration • Classifier assigns Prob(“BR”)rating to a customer. • Weather forecast • Among those classified as 2/10 chance of “BR”, how many are BR? • Closer to diagonal is better.
Bankruptcy Model: Calibration • Over-predicts risk above claimed probability 0.4
Summary of Bankruptcy Model • Automatic, adaptive selection • Finds patterns that predict new observations • Predictive, but not easy to explain • Dynamic feature set • Current research • Information theory allows changing search space • Finds more structure than direct search could find • Validation • Essential only for judging fit. • Better than “hand-made models” that take years to create.
So, where’s the profit in DM? • Automated modeling has become very powerful, avoiding problems of over-fitting. • Role for expert judgment remains • What data to use? • Which features to try first? • What are the economics of the prediction errors? • Collaboration • Data sources • Data analysis • Strategic decisions