Profiting from Data Mining

157 Views

Download Presentation
## Profiting from Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Profiting from Data Mining**Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob**Overview**• Critical stages of data mining process • Choosing the right data, people, and problems • Modeling • Validation • Automated modeling • Feature creation and selection • Exploiting expert knowledge, “insights” • Applications • Little detail – Biomedical: finding predictive risk factors • More detail – Financial: predicting returns on the market • Lots of detail – Credit: anticipating the onset of bankruptcy**Predicting Health Risk**• Who is at risk for a disease? • Example: detect osteoporosis without expense of x-ray • Goals • Improving public health • Savings on medical care • Confirm an informal model with data mining • Many types of features, interested groups • Clinical observations of doctors • Laboratory measurements, “genetic” • Self-reported behavior • Missing data**Predicting the Stock Market**• Small, “hands-on” example • Goals • Better retirement savings? • Money for that special vacation? College? • Trade-offs: risk vs return • Lots of “free” data • Access to accurate historical time trends, macro factors • Recent data more useful than older data • “Simple” modeling technique • Validation**Predicting the Market: Specifics**• Build a regression model • Response is return on the value-weighted S&P • Use standard forward/backward stepwise • Battery of 12 predictors with interactions • Train the model during 1992-1996 (training data) • Model captures most of variation in 5 years of returns • Retain only the most significant features (Bonferroni) • Predict returns in 1997 (validation data) • Another version in Foster, Stine & Waterman**Fitted model predicts...**Exceptional Feb return?**What happened?**Training Period**Claimed versus Actual Error**Actual SquaredPredictionError Claimed**Over-confidence?**• Over-fitting • Model fits the training data too well – better than it can predict the future. • Greedy fitting procedure “Optimization capitalizes on chance” • Some intuition • Coincidences • Cancer clusters, the “birthday problem” • Illustration with an auction • What is the value of the coins in this jar?**Auctions and Over-fitting**What is the value of these coins?**Auctions and Over-fitting**• Auction jar of coins to a class of MBA students • Histogram shows the bids of 30 students • Most were suspicious, but a few were not! • Actual value is $3.85 • Known as “Winner’s Curse” • Similar to over-fitting:best model like high bidder**Profiting from data mining?**• Where’s the profit in this? • “Mining the miners” vs getting value from your data • Lost opportunities • Importance of domain knowledge • Validation as a measure of success • Prediction provides an explicit check • Does your application predict something?**Pitfalls and Role of Management**Over-fitting is dominated by other issues… • Management support • Life in silos • Coordination across domains • Responsibility and reward • Accountability • Who gets the credit when it succeeds?Who suffers if the project is not successful?**Specific Potholes**• Moving targets • “Let’s try this with something else.” • Irrational expectations • “I could have done better than that.” • Not with my data • “It’s our data. You can’t use it.” • “You did not use our data properly.”**Back to a real application…**Emphasis on the statistical issues…**Predicting Bankruptcy**• Goal • Reduce losses stemming from personal bankruptcy • Possible strategies • If can identify those with highest risk of bankruptcy…Take some action • Call them for a “friendly chat” about circumstances • Unilaterally reduce credit limit • Trade-off • Good customers borrow lots of money • Bad customers also borrow lots of money**Predicting Bankruptcy**• “Needle in a haystack” • 3,000,000 months of credit-card activity • 2244 bankruptcies • Simple predictor that all are OK looks pretty good. • What factors anticipate bankruptcy? • Spending patterns? Payment history? • Demographics? Missing data? • Combinations of factors? • Cash Advance + Las Vegas = Problem • We consider more than 100,000 predictors!**Modeling: Predictive Models**• Build the modelIdentify patterns in training data that predict future observations. • Which features are real? Coincidental? • Evaluate the modelHow do you know that it works? • During the model construction phase • Only incorporate meaningful features • After the model is built • Validate by predicting new observations**Are all prediction errors the same?**• Symmetry • Is over-predicting as costly as under-predicting? • Managing inventories and sales • Visible costs versus hidden costs • Does a false positive = a false negative? • Classification in data mining • Credit modeling, flagging “risky” customers • False positive: call a good customer “bad” • False negative: fail to identify a “bad” • Differential costs for different types of errors**Building a Predictive Model**So many choices… • Structure: What type of model? • Neural net • CART, classification tree • Additive model or regression spline • Identification: Which features to use? • Time lags, “natural” transformations • Combinations of other features • Search: How does one find these features? • Brute force has become cheap.**Our Choices**• Structure • Linear regression with nonlinearity via interactions • All 2-way and some 3-way, 4-way interactions • Missing data handled with indicators • Identification • Conservative standard error • Comparison of conservative t-ratio to adaptive threshold • Search • Forward stepwise regression • Coming: Dynamically changing list of features • Good choice affects where you search next.**Identifying Predictive Features**• Classical problem of “variable selection” • Thresholding methods (compare t-ratio to threshold) • Akaike information criterion (AIC) • Bayes information criterion (BIC) • Hard thresholding and Bonferroni • Arguments for adaptive thresholds • Empirical Bayes • Information theory • Step-up/step-down tests**Adaptive Thresholding**• Threshold changes to conform to attributes of data • Easier to add features as more are found. • Threshold for first predictor • Compare conservative t-ratio to Bonferroni. • Bonferroni is about Sqrt(2 log p) • If something significant is found, continue. • Threshold for second predictor • Compare t-ratio to reduced threshold • New threshold is about Sqrt(2 log p/2)**Adaptive Thresholding: Benefits**• EasyAs easy and fast as implementing the standard criterion that is used in stepwise regression. • TheoryResulting model provably as good as best Bayes model for the problem at hand. • Real worldIt works! Finds models with real signal, and stops when the signal runs out.**Bankruptcy Model: Construction**• Data: reserve 80% for validation • Training data • 600,000 months • 458 bankruptcies • Validation data • 2,400,000 months • 1786 bankruptcies • Selection via adaptive thresholding • Compare sequence of t-statistics to Sqrt(2 log p/q) • Dynamic expansion of feature space**Bankruptcy Model: Preview**• Predictors • Initial search identifies 39 • Validation SS monotonically falls to 1650 • Linear fit can do no better than 1735 • Expanded search of higher interactions finds a bit more • Nature of predictors comprising the interactions • Validation SS drops 10 more • Validation: Lift chart • Top 1000 candidates have 351 bankrupt • More validation: Calibration • Close to actual Pr(bankrupt) for most groups.**Bankruptcy Model: Fitting**• Where should the fitting process be stopped?**Bankruptcy Model: Fitting**• Our adaptive selection procedure stops at a model with 39 predictors.**Bankruptcy Model: Validation**• The validation indicates that the fit gets better while the model expands. Avoids over-fitting.**Bankruptcy Model: Linear?**• Choosing from linear predictors (no interactions) does not match the performance of the full search.**Bankruptcy Model: More?**• Searching higher-order interactions offers modest improvement.**Lift Chart**• Measures how well model classifies sought-for group • Depends on rule used to label customers • Very high thresholdLots of lift, but few bankrupt customers are found. • Lower thresholdLift drops, but finds more bankrupt customers.**Generic Lift Chart**Model Random**Bankruptcy Model: Lift**• Much better than diagonal!**Calibration**• Classifier assigns Prob(“BR”)rating to a customer. • Weather forecast • Among those classified as 2/10 chance of “BR”, how many are BR? • Closer to diagonal is better.**Bankruptcy Model: Calibration**• Over-predicts risk above claimed probability 0.4**Summary of Bankruptcy Model**• Automatic, adaptive selection • Finds patterns that predict new observations • Predictive, but not easy to explain • Dynamic feature set • Current research • Information theory allows changing search space • Finds more structure than direct search could find • Validation • Essential only for judging fit. • Better than “hand-made models” that take years to create.**So, where’s the profit in DM?**• Automated modeling has become very powerful, avoiding problems of over-fitting. • Role for expert judgment remains • What data to use? • Which features to try first? • What are the economics of the prediction errors? • Collaboration • Data sources • Data analysis • Strategic decisions