1 / 36

Data Mining – Best Practices

Data Mining – Best Practices. CAS 2008 Spring Meeting Quebec City, Canada Louise Francis, FCAS, MAAA Louise_francis@msn.com , www.data-mines.com. Topics in Data Mining Best Practices. Introduction: Data Mining for Management Data Quality Data augmentation Data adjustments

loring
Download Presentation

Data Mining – Best Practices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining – Best Practices CAS 2008 Spring Meeting Quebec City, Canada Louise Francis, FCAS, MAAA Louise_francis@msn.com, www.data-mines.com

  2. Topics in Data Mining Best Practices • Introduction: Data Mining for Management • Data Quality • Data augmentation • Data adjustments • Method/Software issues • Post deployment monitoring • References & Resources

  3. Introduction • Research by IBM indicates on 1% of data collected by organizations is used for analysis • Predictive Modeling and Data Mining widely embraced by leading businesses • 2002 Strategic Decision Making survey by Hackett Best Practices determined that world class companies adopted predictive modeling technologies at twice the rate of other companies • Important commercial application is Customer retention: 5% increase in retention  95% increase in profit • It costs 5 to 10 times more to acquire new business • Another study of 24 leading companies found that they needed to go beyond simple data collection

  4. Successful Implementation of Data Mining • Data Mining: Process of discovering previously unknown patterns in databases • Needs insights from many different areas • Multidisciplinary effort • Quantitative experts • IT • Business experts • Managers • Upper management

  5. Becoming a better practitioner Manage the Human Side of Analytics Pay greater attention to the interaction of models and humans • Data Collection • Communicating business benefits • Belief, model understanding, model complexity • ‘Tribal’ Knowledge as model attributes • Behavioral change and Transparency • Disruption in ‘standard’ processes • Threat of obsolescence (automation) Don’t over rely on the technology and recognize the disruptive role you play

  6. CRISP-DM • Cross Industry Standard Process for Data Mining • Standardized approach to data mining • www.crisp-dm.org

  7. Phases of CRISP-DM

  8. Data Quality • Scope of problem • How it is addressed • New educational resources for actuaries

  9. Survey of Actuaries • Data quality issues have a significant impact on the work of general insurance (P&C) actuaries • About a quarter of their time is spent on such issues • About a third of projects are adversely affected • See “Dirty Data on Both Sides of the Pond” – 2008 CAS Winter Forum • Data quality issues consume significantly more time on large predictive modelling Projects

  10. Statistical Data Editing • Process of Checking data for errors and correcting them • Uses subject matter experts • Uses statistical analysis of data • May include using methods to “fill in” missing values • Final result of SDE is clean data as well as summary of underlying causes of errors • See article in Encyclopedia of Data Warehousing and Data Mining

  11. Final Step Decisions Step 0 Data Requirements EDA: Overview Step 1 Data Collection • Typically first step in analyzing data • Purpose: • Explore structure of the data • Find outliers anderrors • Uses simple statistics and graphical techniques • Examples include histograms, descriptive statistics and frequency tables Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results

  12. Final Step Decisions Step 0 Data Requirements EDA: Histograms Step 1 Data Collection Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results

  13. Data Educational Materials Working Party Formation • The closest thing to data quality on the CAS syllabus are introductions to statistical plans • The CAS Data Management and Information Committee realized that SOX and Predictive Modeling have increased the need for quality data • So they formed the CAS Data Management Educational Materials working party to find and gather materials to educate actuaries

  14. CAS Data Management Educational Materials Working Party Publications • Book reviews of data management and data quality texts in the CAS Actuarial Review starting with the August 2006 edition • These reviews are combined and compared in “Survey of Data Management and Data Quality Texts,” CAS Forum, Winter 2007, www.casact.org This presentation references our recently published paper: • “Actuarial IQ (Information Quality)” published in the Winter 2008 edition of the CAS Forum: http://www.casact.org/pubs/forum/08wforum/

  15. Final Step Decisions Step 0 Data Requirements Data Flow Step 1 Data Collection Information Quality involves all steps: • Data Requirements • Data Collection • Transformations & Aggregations • Actuarial Analysis • Presentation of Results To improve Final Step: • Making Decisions Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results

  16. Data Augmentation • Add information from Internal data • Add information from external data • For overview of inexpensive sources of data see: “Free and Cheap Sources of Data”, 2007 Predictive modeling seminar and “External Data Sources” at 2008 Ratemaking Seminar

  17. Data Augmentation – Internal Data • Create aggregated statistics from internal data sources • Number of lawyers per zip • Claim frequency rate per zip • Frequency of back claims per state • Use unstructured data • Text Mining

  18. Data Adjustments • Trend • Adjust all records to common cost level • Use model to estimate trend • Development • Adjust all losses to ultimate • Adjust all losses to a common age • Use model to estimate future development

  19. KDnuggets Poll on Data

  20. Methods: What are data miners using? How well does it work?

  21. Supervised learning Most common situation A dependent variable Frequency Loss ratio Fraud/no fraud Some methods Regression Trees/Machine Learning Some neural networks Unsupervised learning No dependent variable Group like records together A group of claims with similar characteristics might be more likely to be fraudulent Ex: Territory assignment, Text Mining Some methods Association rules K-means clustering Kohonen neural networks Major Kinds of Data Mining

  22. KDnuggets Poll on Methods

  23. KDnuggets Poll on Open Source Software

  24. The Supervised Methods and Software EvaluatedResearch by Derrig and Francis 1) TREENET 7) Iminer Ensemble 2) Iminer Tree 8) MARS 3) SPLUS Tree 9) Random Forest 4) CART 10) Exhaustive Chaid 5) S-PLUS Neural 11) Naïve Bayes (Baseline) 6) Iminer Neural 12) Logistic reg ( (Baseline)

  25. TREENET ROC Curve – IME Explain AUROC AUROC = 0.701

  26. Monitoring Models • Monitor use of model • Monitor data going into model • Monitor performance • This requires more mature data

  27. Novelty Detection An example of model interaction with people to improve business outcomes Problem Statements: • At the time of underwriting a risk, how different is the subject risk from the data used to build the model? • How are the differences, if any, logically grouped for business meaning   

  28. Clustering Methods Make Models • Select features that you are interested in clustering, e.g. Demographics, Risk, Auto, Employment • Run cluster algorithms within the grouped features to find homogenous groups (let the data tell you the groupings). Each member has a distance to the ‘center’ of the cluster. • Explore each cluster and statistically describe them compared to the entire ‘world’ from the training data; create thresholds for distance to the center that you care about; may add additional description and learning • Assign business meaning (names) to cluster members; homogenous group; Deploy; score new data as it becomes available • Look at novelty within each cluster on the new sample; distance, single variable differences • Use the Threshold to determine differences from the cluster membership. • Investigate for business impact or unexpected changes

  29. Dimensional Novelty Market Cycles Policy Limits Exposure Geography Demographics Operationalize Book drift Evaluation of pricing and marketing activities Model refresh cycle Regulatory Support Novelty Score Uses Novelty Score: to detect ‘drift’ of aspects of clusters in predictor data over time

  30. Example – Automobile Insurance Data

  31. Six clusters with the following statistical profile and distribution in the sample set; look a the data and assign names to the groups (in this case 3 variables) Demographic Features and Clusters WORLD The view of the current book

  32. Display the distribution of named clusters within the grouping of features (Demographic Cluster) in the test set View of the clusters in the current book business within Demographics

  33. Monitor the changes in distribution of the clusters in the data over time Two clusters now show up in different percentages After 6 months Initial Customer Base

  34. Humility • Models incorporate significant uncertainties about parameters • When deployed, models will likely not be as good as they were on historic data • Need to appreciate the limitations of the models

  35. Additional References • www.kdnuggets.com • www.r-project.org • Encyclopedia of Data Warehousing and Data Mining, John Wang • For GLMs: 2004 CAS Discussion Paper Program • 2008 Discussion Paper program on multivariate methods • “Distinguishing the Forest From the Trees” – 2006 Winter Forum, updated at www.data-mines .com • See other papers by Francis on CAS web site • Data Preparation for Data Mining using SAS, Mamdouh Refaat • Data Mining for Business Intelligence: Concepts, Techniques and Applications in Microsoft Excel with XL Miner, Shmuel, Patel and Bruce

  36. Questions?

More Related