profiting from data mining l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Profiting from Data Mining PowerPoint Presentation
Download Presentation
Profiting from Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 39

Profiting from Data Mining - PowerPoint PPT Presentation


  • 152 Views
  • Uploaded on

Profiting from Data Mining. Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob. Overview. Critical stages of data mining process Choosing the right data, people, and problems Modeling Validation Automated modeling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Profiting from Data Mining' - johana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
profiting from data mining

Profiting from Data Mining

Bob Stine

Department of Statistics

The Wharton School, Univ of Pennsylvania

April 5, 2002

www-stat.wharton.upenn.edu/~bob

overview
Overview
  • Critical stages of data mining process
    • Choosing the right data, people, and problems
    • Modeling
    • Validation
  • Automated modeling
    • Feature creation and selection
    • Exploiting expert knowledge, “insights”
  • Applications
    • Little detail – Biomedical: finding predictive risk factors
    • More detail – Financial: predicting returns on the market
    • Lots of detail – Credit: anticipating the onset of bankruptcy
predicting health risk
Predicting Health Risk
  • Who is at risk for a disease?
    • Example: detect osteoporosis without expense of x-ray
  • Goals
    • Improving public health
    • Savings on medical care
    • Confirm an informal model with data mining
  • Many types of features, interested groups
    • Clinical observations of doctors
    • Laboratory measurements, “genetic”
    • Self-reported behavior
  • Missing data
predicting the stock market
Predicting the Stock Market
  • Small, “hands-on” example
  • Goals
    • Better retirement savings?
    • Money for that special vacation? College?
    • Trade-offs: risk vs return
  • Lots of “free” data
    • Access to accurate historical time trends, macro factors
    • Recent data more useful than older data
  • “Simple” modeling technique
  • Validation
predicting the market specifics
Predicting the Market: Specifics
  • Build a regression model
    • Response is return on the value-weighted S&P
    • Use standard forward/backward stepwise
    • Battery of 12 predictors with interactions
  • Train the model during 1992-1996 (training data)
    • Model captures most of variation in 5 years of returns
    • Retain only the most significant features (Bonferroni)
  • Predict returns in 1997 (validation data)
  • Another version in Foster, Stine & Waterman
fitted model predicts
Fitted model predicts...

Exceptional Feb return?

what happened
What happened?

Training Period

claimed versus actual error
Claimed versus Actual Error

Actual

SquaredPredictionError

Claimed

over confidence
Over-confidence?
  • Over-fitting
    • Model fits the training data too well – better than it can predict the future.
    • Greedy fitting procedure “Optimization capitalizes on chance”
  • Some intuition
    • Coincidences
      • Cancer clusters, the “birthday problem”
    • Illustration with an auction
      • What is the value of the coins in this jar?
auctions and over fitting
Auctions and Over-fitting

What is the value of these coins?

auctions and over fitting12
Auctions and Over-fitting
  • Auction jar of coins to a class of MBA students
  • Histogram shows the bids of 30 students
  • Most were suspicious, but a few were not!
  • Actual value is $3.85
  • Known as “Winner’s Curse”
  • Similar to over-fitting:best model like high bidder
profiting from data mining13
Profiting from data mining?
  • Where’s the profit in this?
    • “Mining the miners” vs getting value from your data
    • Lost opportunities
  • Importance of domain knowledge
  • Validation as a measure of success
    • Prediction provides an explicit check
    • Does your application predict something?
pitfalls and role of management
Pitfalls and Role of Management

Over-fitting is dominated by other issues…

  • Management support
    • Life in silos
    • Coordination across domains
  • Responsibility and reward
    • Accountability
    • Who gets the credit when it succeeds?Who suffers if the project is not successful?
specific potholes
Specific Potholes
  • Moving targets
    • “Let’s try this with something else.”
  • Irrational expectations
    • “I could have done better than that.”
  • Not with my data
    • “It’s our data. You can’t use it.”
    • “You did not use our data properly.”
back to a real application
Back to a real application…

Emphasis on the statistical issues…

predicting bankruptcy
Predicting Bankruptcy
  • Goal
    • Reduce losses stemming from personal bankruptcy
  • Possible strategies
    • If can identify those with highest risk of bankruptcy…Take some action
      • Call them for a “friendly chat” about circumstances
      • Unilaterally reduce credit limit
  • Trade-off
    • Good customers borrow lots of money
    • Bad customers also borrow lots of money
predicting bankruptcy18
Predicting Bankruptcy
  • “Needle in a haystack”
    • 3,000,000 months of credit-card activity
    • 2244 bankruptcies
    • Simple predictor that all are OK looks pretty good.
  • What factors anticipate bankruptcy?
    • Spending patterns? Payment history?
    • Demographics? Missing data?
    • Combinations of factors?
      • Cash Advance + Las Vegas = Problem
  • We consider more than 100,000 predictors!
modeling predictive models
Modeling: Predictive Models
  • Build the modelIdentify patterns in training data that predict future observations.
    • Which features are real? Coincidental?
  • Evaluate the modelHow do you know that it works?
    • During the model construction phase
      • Only incorporate meaningful features
    • After the model is built
      • Validate by predicting new observations
are all prediction errors the same
Are all prediction errors the same?
  • Symmetry
    • Is over-predicting as costly as under-predicting?
    • Managing inventories and sales
    • Visible costs versus hidden costs
  • Does a false positive = a false negative?
    • Classification in data mining
    • Credit modeling, flagging “risky” customers
    • False positive: call a good customer “bad”
    • False negative: fail to identify a “bad”
    • Differential costs for different types of errors
building a predictive model
Building a Predictive Model

So many choices…

  • Structure: What type of model?
      • Neural net
      • CART, classification tree
      • Additive model or regression spline
  • Identification: Which features to use?
      • Time lags, “natural” transformations
      • Combinations of other features
  • Search: How does one find these features?
      • Brute force has become cheap.
our choices
Our Choices
  • Structure
    • Linear regression with nonlinearity via interactions
    • All 2-way and some 3-way, 4-way interactions
    • Missing data handled with indicators
  • Identification
    • Conservative standard error
    • Comparison of conservative t-ratio to adaptive threshold
  • Search
    • Forward stepwise regression
    • Coming: Dynamically changing list of features
      • Good choice affects where you search next.
identifying predictive features
Identifying Predictive Features
  • Classical problem of “variable selection”
  • Thresholding methods (compare t-ratio to threshold)
    • Akaike information criterion (AIC)
    • Bayes information criterion (BIC)
    • Hard thresholding and Bonferroni
  • Arguments for adaptive thresholds
    • Empirical Bayes
    • Information theory
    • Step-up/step-down tests
adaptive thresholding
Adaptive Thresholding
  • Threshold changes to conform to attributes of data
    • Easier to add features as more are found.
  • Threshold for first predictor
    • Compare conservative t-ratio to Bonferroni.
    • Bonferroni is about Sqrt(2 log p)
    • If something significant is found, continue.
  • Threshold for second predictor
    • Compare t-ratio to reduced threshold
    • New threshold is about Sqrt(2 log p/2)
adaptive thresholding benefits
Adaptive Thresholding: Benefits
  • EasyAs easy and fast as implementing the standard criterion that is used in stepwise regression.
  • TheoryResulting model provably as good as best Bayes model for the problem at hand.
  • Real worldIt works! Finds models with real signal, and stops when the signal runs out.
bankruptcy model construction
Bankruptcy Model: Construction
  • Data: reserve 80% for validation
    • Training data
      • 600,000 months
      • 458 bankruptcies
    • Validation data
      • 2,400,000 months
      • 1786 bankruptcies
  • Selection via adaptive thresholding
    • Compare sequence of t-statistics to Sqrt(2 log p/q)
    • Dynamic expansion of feature space
bankruptcy model preview
Bankruptcy Model: Preview
  • Predictors
    • Initial search identifies 39
      • Validation SS monotonically falls to 1650
      • Linear fit can do no better than 1735
    • Expanded search of higher interactions finds a bit more
      • Nature of predictors comprising the interactions
      • Validation SS drops 10 more
  • Validation: Lift chart
    • Top 1000 candidates have 351 bankrupt
  • More validation: Calibration
    • Close to actual Pr(bankrupt) for most groups.
bankruptcy model fitting
Bankruptcy Model: Fitting
  • Where should the fitting process be stopped?
bankruptcy model fitting29
Bankruptcy Model: Fitting
  • Our adaptive selection procedure stops at a model with 39 predictors.
bankruptcy model validation
Bankruptcy Model: Validation
  • The validation indicates that the fit gets better while the model expands. Avoids over-fitting.
bankruptcy model linear
Bankruptcy Model: Linear?
  • Choosing from linear predictors (no interactions) does not match the performance of the full search.
bankruptcy model more
Bankruptcy Model: More?
  • Searching higher-order interactions offers modest improvement.
lift chart
Lift Chart
  • Measures how well model classifies sought-for group
  • Depends on rule used to label customers
    • Very high thresholdLots of lift, but few bankrupt customers are found.
    • Lower thresholdLift drops, but finds more bankrupt customers.
bankruptcy model lift
Bankruptcy Model: Lift
  • Much better than diagonal!
calibration
Calibration
  • Classifier assigns Prob(“BR”)rating to a customer.
  • Weather forecast
  • Among those classified as 2/10 chance of “BR”, how many are BR?
  • Closer to diagonal is better.
bankruptcy model calibration
Bankruptcy Model: Calibration
  • Over-predicts risk above claimed probability 0.4
summary of bankruptcy model
Summary of Bankruptcy Model
  • Automatic, adaptive selection
    • Finds patterns that predict new observations
    • Predictive, but not easy to explain
  • Dynamic feature set
    • Current research
    • Information theory allows changing search space
    • Finds more structure than direct search could find
  • Validation
    • Essential only for judging fit.
    • Better than “hand-made models” that take years to create.
so where s the profit in dm
So, where’s the profit in DM?
  • Automated modeling has become very powerful, avoiding problems of over-fitting.
  • Role for expert judgment remains
    • What data to use?
    • Which features to try first?
    • What are the economics of the prediction errors?
  • Collaboration
    • Data sources
    • Data analysis
    • Strategic decisions