1 / 54

An Overview and Example of Data Mining

University of Rhode Island Department of Computer Science and Statistics March 30, 2007. An Overview and Example of Data Mining. Daniel T. Larose, Ph.D. Professor of Statistics Director, Data Mining @CCSU Editor, Wiley Series on Methods and Applications in Data Mining

river
Download Presentation

An Overview and Example of Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of Rhode IslandDepartment of Computer Science and Statistics March 30, 2007 An Overview and Exampleof Data Mining Daniel T. Larose, Ph.D.Professor of Statistics Director, Data Mining @CCSUEditor, Wiley Series on Methods and Applications in Data Mining larosed@ccsu.edu www.math.ccsu.edu/larose URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  2. Overview • Part One: • A Brief Overview of Data Mining • Part Two: • An Example of Data Mining: • Modeling Response to Direct Mail Marketing • But first, a shameless plug … URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  3. Master of Science in DM at CCSUFaculty • Dr. Roger Bilisoly (from Ohio State Univ., Statistics) • Text Mining, Intro to Data Mining • Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS) • Data Mining for Genomics and Proteomics, Biomarker Discovery • Dr. Zdravko Markov (from Sofia Univ, CS) • Data Mining (CS perspective), Machine Learning • Dr. Daniel Miller (from UConn, Statistics) • Applied Multivariate Analysis, Mathematical Statistics II, Intro to Data Mining • Dr. Krishna Saha (from Univ of Windsor, Statistics) • Intro to Data Mining using R • Dr. Daniel Larose (Program Director) (from UConn, Statistics) • Intro to Data Mining, Data Mining Methods, Applied Data Mining, Web Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  4. Master of Science in DM at CCSU Program (36 credits) • Core Courses (27 credits) All available online. • Stat 521 Introduction to Data Mining (4 cr) • Stat 522 Data Mining Methods (4 cr) • Stat 523 Applied Data Mining (4 cr) • Stat 525 Web Mining • Stat 526 Data Mining for Genomics and Proteomics • Stat 527 Text Mining • Stat 416 Mathematical Statistics II • Stat 570 Applied Multivariate Analysis • Electives ( 6 credits. Choose two)  • CS 570 Topics in Artificial Intelligence: Machine Learning • CS 580 Topics in Advanced Database: Data Mining • Stat 455 Experimental Design • Stat 551 Applied Stochastic Processes • Stat 567 Linear Models • Stat 575 Mathematical Statistics III   • Stat 529 Current Issues in Data Mining    • Capstone Requirement: Stat 599 Thesis (3 credits) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  5. Master of Science in DM at CCSU • Only MS in DM that is entirely online. • Some courses available on campus. • Student must come to CCSU to present Thesis • We reach students in about 30 US States and a dozen foreign countries • Half of our students already have master’s degrees • About 15% already have Ph.D.’s • Typical student is a mid-career professional • Backgrounds are diverse: Computer Science, Engineering, Finance, Chemistry, Database Admin, Statistics, etc. • www.ccsu.edu/datamining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  6. Graduate Certificate in Data Mining • 18 Credits: • Required Courses (12 credits) • Stat 521 Introduction to Data Mining • Stat 522 Data Mining Methods and Models • Stat 523 Applied Data Mining • Elective Courses (6 credits. Choose Two): • Stat 525 Web Mining • Stat 526 Data Mining for Genomics and Proteomics • Stat 527 Text Mining • Stat 529 Current Issues in Data Mining • Some other graduate-level data mining or statistics course, with approval of advisor. • No Mathematical Statistics requirement. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  7. Material for Part I Drawn From:Discovering Knowledge in Data: An Introduction to Data Mining(Wiley, 2005) • Chapter 1. An Introduction to Data Mining • Chapter 2. Data Preprocessing • Chapter 3. Exploratory Data Analysis • Chapter 4. Statistical Approaches to Estimation and Prediction • Chapter 5. K-Nearest Neighbor • Chapter 6. Decision Trees • Chapter 7. Neural Networks • Chapter 8. Hierarchical and K-Means Clustering • Chapter 9. Kohonen networks • Chapter 10. Association Rules • Chapter 11. Model Evaluation Techniques URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  8. Material for Part II Drawn From:Data Mining Methods and Models(Wiley, 2006) • Chapter 1. Dimension Reduction Methods • Chapter 2. Regression Modeling • Chapter 3. Multiple Regression and Model Building • Chapter 4. Logistic Regression • Chapter 5. Naïve Bayes Classification and Bayesian Networks • Chapter 6. Genetic Algorithms • Chapter 7. Case Study: Modeling Response to Direct-Mail Marketing URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  9. No Material Drawn From:Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage(Wiley, April 2007) • Part One: Web Structure Mining • Information Retrieval and Web Search • Hyperlink-Based Ranking • Part Two: Web Content Mining • Clustering • Evaluating Clustering • Classification • Part Three: Web Usage Mining • Data Preprocessing, • Exploratory Data Analysis, • Association Rules, Clustering, and Classification for Web Usage Mining • With Dr. Zdravko Markov, Computer Science, CCSU URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  10. Call for Book ProposalsWiley Series on Methods and Applications in Data Mining • Suggested topics: • Data Mining in Bioinformatics • Emerging Techniques in Data Mining (e.g., SVM) • Data Mining with Evolutionary Algorithms • Drug Discovery Using Data Mining • Mining Data Streams • Visual Analysis in Data Mining • Books in press: • Data Mining for Genomics and Proteomics, by Darius Dziuda • Practical Text Mining Using Perl, by Roger Bilisoly • Contact Series Editor at larosed@ccsu.edu URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  11. What is Data Mining? • “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” • David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  12. Why Data Mining? • “We are drowning in information but starved for knowledge.” • John Naisbitt, Megatrends, 1984. • “The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom.” • Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  13. Need for Human Direction • Automation is no substitute for human supervision and input. • Humans need to be actively involved at every phase of data mining process. • “Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.” • - Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  14. “Data Mining is Easy to Do Badly” • Black box software • Powerful, “easy-to-use” data mining algorithms • Makes their misuse dangerous. • Too easy to point and click your way to disaster. • What is needed: • An understanding of the underlying algorithmic and statistical model structures. • An understanding of which algorithms are most appropriate in which situations and for which types of data. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  15. CRISP-DM: Cross-Industry Standard Process for Data Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  16. CRISP: DM as a Process • Business / Research Understanding Phase Enunciate your objectives • Data Understanding Phase: EDA • Data Preparation Phase: Preprocessing • Modeling Phase: Fun and interesting! • Evaluation Phase Confluence of results? Objectives Met? • Deployment Phase: Use results to solve problem. If desired: Use lessons learned to reformulate business / research objective. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  17. What About Data Dredging? Data Dredging “A sufficiently exhaustive search will certainly throw up patterns of some kind. Many of these patterns will simply be a product of random fluctuations, and will not represent any underlying structure.” • David J. Hand, Data Mining: Statistics and More?The American Statistician, May, 1998. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  18. Guarding Against Data Dredging:Cross-Validation is the Key • Partition the data into training set and test set. • If the pattern shows up in both data sets, decreases the probability that it represents noise. • More generally, may use n-fold cross-validation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  19. Inference and Huge Data Sets • Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data mining applications. • Even very tiny effects will be found significant. • So, data mining tends to de-emphasize inference URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  20. Need for Transparency and Interpretability • Data mining models should be transparent • Results should be interpretable by humans • Decision Trees are transparent • Neural Networks tend to be opaque • If a customer complains about why he/she was turned down for credit, we should be able to explain why, without saying “Our neural net said so.” URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  21. Part Two:Modeling Response to Direct Mail Marketing Business Understanding Phase: • Clothing Store Purchase Data • Results of a direct mail marketing campaign • Task: Construct a classification model • For classifying customers as either responders or non-responders to the marketing campaign, • To reduce costs and increase return-on-investment URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  22. Data Understanding: The Clothing Store dataset List of fields in the dataset (28,7999 customers, 51 fields) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  23. Data Preparation and EDA Phase • Not covered in this presentation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  24. Modeling Strategy • Apply principal components analysis to address multicollinearity. • Apply cluster analysis. Briefly profile clusters. • Balance the training data set. • Establish baseline model performance • In terms of expected profit per customer contacted. • Apply classification algorithms to training data set: • CART • C5.0 (C4.5) • Neural networks • Logistic regression. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  25. Modeling Strategy continued • Evaluate each model using test data set. • Apply misclassification costs in line with cost benefit table. • Apply overbalancing as a surrogate for misclassification costs. • Find best overbalancing proportion. • Combine predictions from four models • Using model voting. • Using mean response probabilities. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  26. Principal Components Analysis (PCA) • Multicollinearity does not degrade prediction accuracy. • But muddles individual predictor coefficients. • Interested in predictor characteristics, customer profiling, etc? • Then PCA is required. • But, if interested solely in classification (prediction, estimation), • PCA not strictly required. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  27. Report Two Model Sets: • Model Set A: • Includes principal components • All purpose model set • Model Set B: • Includes correlated predictors, not principal components • Use restricted to classification URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  28. Principal Components Analysis (PCA) • Seven correlated variables. • Two components extracted • Account for 87% of variability URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  29. Principal Components Analysis (PCA) • Principal Component 1: • Purchasing Habits • Customer general purchasing habits • Expect component to be strongly indicative of response • Principal Component 2: • Promotion Contacts • Unclear whether component will be associated with response • Components validated by test data set URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  30. BIRCH Clustering Algorithm • Requires only one pass through data set • Scalable for large data sets • Benefit: Analyst need not pre-specify number of clusters • Drawback: Sensitive to initial records encountered • Leads to widely variable cluster solutions • Requires “outer loop” to find consistent cluster solution • Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  31. Cluster 3 shows: Higher response for flag predictors Higher averages for numeric predictors BIRCH Clusters URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  32. Cluster 3 has highest response rate (red). Cluster 1: 7.6% Cluster 2: 7.1% Cluster 3: 33.0% BIRCH Clusters URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  33. Balancing the Data • For “rare” classes, provides more equitable distribution. • Drawback: Loss of data: • Here, 40% of non-responders randomly omitted • All responders retained • Responders increases from 16.58% to 24.76% • Test data set should never be balanced URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  34. False Positive vs. False Negative:Which is Worse? • For direct mail marketing, a false negative error is probably worse than a false positive. • Generate misclassification costs based on the observed data. • Construct cost-benefit table URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  35. Decision Cost / Benefit Analysis URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  36. Establish Baseline Model Performance • Benchmarks • “Don’t Send a Marketing Promotion to Anyone” Model • “Send a Marketing Promotion to Everyone” Model • Will compare candidate models against this baseline error rate. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  37. Model Set A (With 50% Balancing) • No model beats benchmark of $2.63 profit per customer • Misclassification costs had not been applied • Now define FN cost = $28.40, FP cost = $2 • Outperformed baseline “Send to everyone” model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  38. Model Set A: Effect of Misclassification Costs • For the 447 highlighted records: • Only 20.8% responded. • But model predicts positive response. • Due to high false negative misclassification cost. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  39. Model Set A: PCA Component 1 is Best Predictor • First principal component ($F-PCA-1), Purchasing Habits, represents both the root node split and the secondary split • Most important factor for predicting response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  40. Over-Balancing as a Surrogate for Misclassification Costs • Software limitation: • Neural network and logistic regression models in Clementine: • Lack methods for applying misclassification costs • Over-balancing is an alternate method which can achieve similar results • Starves the classifier of instances of non-response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  41. Over-Balancing as a Surrogate for Misclassification Costs • Neural network model results • Three over-balanced models outperform baseline • Properly applied, over-balancing can be used as a surrogate for misclassification costs URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  42. Over-Balancing as a Surrogate for Misclassification Costs • Apply 80% - 20% over-balancing to the other models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  43. Combination Models: Voting • Smoothes out strengths and weaknesses of each model • Each model supplies a prediction for each record • Count the votes for each record • Disadvantage of combination models: • Lack of easy interpretability • Four competing combination models… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  44. Combination Models: Voting Mail a Promotion only if: • All four models predict response • Protects against false positive • All four classification algorithms must agree on a positive prediction • At least three models predict response • At least two models predict response • Any model predicts response • Protects against false negatives URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  45. Combination Models: Voting • None beat the logistic regression model: $2.96 profit per customer • Perhaps combination models will do better with Model Collection B… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  46. Model Collection B: Non-PCA Models • Models retain correlated variables • Use restricted to prediction only • Since the correlated variables are highly predictive • Expect Collection B will outperform the PCA models URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  47. Model Collection B: CART and C5.0 • Using misclassification costs, and 50% balancing • Both models outperform the best PCA model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  48. Model Collection B: Over-Balancing • Apply over-balancing as a surrogate for misclassification costs for all models • Best performance thus far. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  49. Combination Models: Voting • Combine the four models via voting and 80%-20% over-balancing • Synergy: Combination model outperforms any individual model. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

  50. Combining Models Using Mean Response Probabilities • Combine the confidences that each model reports for its decisions • Allows finer tuning of the decision space • Derive a new variable: • Mean Response Probability (MRP): • Average of response confidences of the four models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

More Related