1 / 45

Data Mining Processes

Data Mining Processes. Identify actionable results. CRISP-DM. Cross-Industry Standard Process for Data Mining One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology. CRISP-DM Phases. Business (or problem) understanding

scotti
Download Presentation

Data Mining Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Processes Identify actionable results

  2. CRISP-DM • Cross-Industry Standard Process for Data Mining • One of first comprehensive attempts toward standard process model for data mining • Independent of industry sector & technology

  3. CRISP-DM Phases • Business (or problem) understanding • Data understanding • Data preparation • Transform & create data set for modeling • Modeling • Evaluation • Check good models, evaluate to assure nothing missing • Deployment

  4. Business Understanding • Solve a specific problem • Clear definition helps • Measurable success criteria • Convert business objectives to set of data-mining goals • What to achieve in technical terms

  5. Data Understanding • Related data Can come from many sources • Internal • ERP (or MIS) • Data Warehouse • External • Government data • Commercial data • Created • Research

  6. Data Preparation Clean data • Formats, gaps, filters outliers & redundancies Unified numerical scales • Nominal data • code • Ordinal data • Nominal code or scale • Cardinal data

  7. Types of Data

  8. Modeling • Data Treatment • Training set • Test set • Maybe others • Techniques • Association • Classification • Clustering • Predictions • Sequential patterns

  9. Evaluation • Does model meet business objectives? • Any important business objectives not addressed? • Does model make sense? • Is model actionable?

  10. Deployment • Ongoing monitoring & maintenance • Evaluate performance against success criteria • Market reaction & competitor changes

  11. Example • Training set for computer purchase • 16 records • 5 attributes • Goal • Find classifier for consumer behavior

  12. Database (1st half)

  13. Database (2nd half)

  14. Data Selection • Gender has weak relationship with purchase • Based on correlation • Drop gender • Selected Attribute Set {Age, Income, Student, Credit}

  15. Data Preprocessing • Income unknown in Case 15 • Credit not available in Case 16 • Drop these noisy cases

  16. Data Transformation • Assign numerical values to each attribute • Age:≤30 = 3 31-40 = 2 >40 = 1 • Income: High = 3 Medium = 2 Low = 1 • Student: Yes = 2 No = 1 • Credit: Excellent = 2 Fair = 1

  17. Data Mining • Categorize output • Buys = C1 Doesn’t buy = C2 • Conduct analysis • Model says A8, A12 don’t buy; rest do • Of the actual yes, 8 correct and 1 not • Of the actual no, 4 correct and 1 not

  18. Data Interpretation • Test on independent data

  19. Test Data Set

  20. Confusion Matrix

  21. Measures • Correct classification rate 9/10 = 0.90 • Cost function cost of error: model says buy, actual no $20 model says no, actual buy $200 • 1 x $20 + 0 x $200 = $20

  22. Goals • Avoid broad concepts: • Gain insight; discover meaningful patterns; learn interesting things • Can’t measure attainment • Narrow and specify: • Identify customers likely to renew; reduce churn; • Rank order by propensity to…;

  23. Goals • Description: what is • understand • explain • discover knowledge • Prescription: what should be done • classify • predict

  24. Goal • Method A: • four rules, explains 70% • Method B: • fifty rules, explains 72% BEST? Gain understanding: Method A better minimum description length (MDL) Reduce cost of mailing: Method B better

  25. Measurement • Accuracy • How well does model describe observed data? • Confidence levels • a proportion of the time between lower and upper limits • Comprehensibility • Whole or parts?

  26. Measuring Predictive • Classification & prediction: error rate = incorrect / total requires evaluation set be representative • Estimators predicted – actual (MAD: Mean Absolute Deviation, MSE: Mean Squre Error MAPE: Mean Absolute Percent Error ) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off

  27. Statistics • Population - entire group studied • Sample - subset from population • Bias - difference between sample average & population average • mean, median, mode • distribution • significance • correlation, regression

  28. Classification Models • LIFT = probability in class by sample divided by probability in class by population • if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 • Best lift not necessarily best need sufficient sample size as confidence increases, longer list but lower lift

  29. Lift Chart

  30. Measuring Impact • Ideal - $ (NPV, NetPresentValue) because of expenditure • Mass mailing may be better • Depends on: • fixed cost • cost per recipient • cost per respondent • value of positive response

  31. Bottom Line • Return on investment

  32. Example Application • Telephone industry • Problem: Unpaid bills • Data mining used to develop models to predict nonpayment as early as possible

  33. Knowledge Discovery Process

  34. Telephone Bill Study • Billing period sequence analyzed • Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period • Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

  35. 1: Business Understanding • Predict which customers would be insolvent • In time for firm to take preventive measures (and avert losing good customers) • Hypothesis: • Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

  36. 2: Data Understanding • Static customer information available in files • Bills, payments, usage • Used data warehouse to gather & organize data • Coded to protect customer privacy

  37. Creating Target Data Set • Customer files • Customer information • Disconnects • Reconnections • Time-dependent data • Bills • Payments • Usage • 100,000 customers over 17-month period • Stratified sampling to assure all groups appropriately represented

  38. 3: Data Preparation • Filtered out incomplete data • Deleted inexpensive calls • Reduced data volume about 50% • Low number of fraudulent cases • Cross-checked with phone disconnects • Lagged data made synchronization necessary

  39. Data Reduction & Projection • Information grouped by account • Customer data aggregated by 2-week periods • Discriminant analysis on 23 categories • Calculated average owed by category (significant) • Identified extra charges (significant) • Investigated payment by installments (not significant)

  40. Choosing Data Mining Function • Classes: • Most possibly solvent (99.3%) • Most possibly insolvent (0.7%) • Costs of error widely different • New data set created through stratified sampling • Retained all insolvent • Altered distribution to 90% solvent • Used 2,066 cases total • Critical period identified • Last 15 two-week periods before service interruption • Variables defined by counting measures in two-week periods • 46 variables as candidate discriminant factors

  41. 4: Modeling • Discriminant Analysis • Linear model • SPSS – stepwise forward selection • Decision Trees • Rule-based classifier • Neural Networks • Nonlinear model

  42. Data Mining • Training set about 2/3rds • Rest test • Discriminant analysis • Used 17 variables • Equal costs – 0.875 correct • Unequal costs – 0.930 correct • Rule-based – 0.952 correct • Neural network – 0.929 correct

  43. 5: Evaluation • 1st objective to maximize accuracy of predicting insolvent customers • Decision tree classifier best • 2nd objective to minimize error rate for solvent customers • Neural network model close to Decision tree • Used all 3 on case-by-case basis

  44. Coincidence Matrix – Combined Models

  45. 6: Implementation • Every customer examined using all 3 algorithms • If all 3 agreed, used that classification • If disagreement, categorized as unclassified • Correct on test data 0.898 • Only 1 actually solvent customer would have been disconnected

More Related