1 / 36

Data Mining

This seminar led by data mining expert Louise Francis explores the main data mining methods, such as decision trees, neural networks, and clustering, and their application in ratemaking. Simulated data for automobile claim frequency is used to demonstrate data challenges and the benefits of data mining.

connj
Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa. Louise Francis, FCAS, MAAA Louise_francis@msn.com Francis Analytics and Actuarial Data Mining, Inc.

  2. Objectives • Answer the question: Why use data mining? • Introduce the main data mining methods • Decision Trees • Neural Networks • MARS • Clustering Francis Analytics and Actuarial Data Mining, Inc.

  3. The Data • Simulated Data for Automobile Claim Frequency • Three Factors • Territory • Four Territories • Age • Continuous Function • Mileage • High, Low Francis Analytics and Actuarial Data Mining, Inc.

  4. Data Challenges • Nonlinearities • Relation between dependent variable and independent variables is not linear or cannot be transformed to linear • Interactions • Relation between independent and dependent variable varies by one or more other variables • Correlations • Predictor variables are correlated with each other Francis Analytics and Actuarial Data Mining, Inc.

  5. Simulated Example: Probability of Claim vs. Ageby Territory and Mileage Group Francis Analytics and Actuarial Data Mining, Inc.

  6. Claim Frequency Data Francis Analytics and Actuarial Data Mining, Inc.

  7. Independent Probabilities for Each Variable Francis Analytics and Actuarial Data Mining, Inc.

  8. Decision Trees • Recursively partitions the data • Often sequentially bifurcates the data – but can split into more groups • Applies goodness of fit to select best partition at each step • Selects the partition which results in largest improvement to goodness of fit statistic Francis Analytics and Actuarial Data Mining, Inc.

  9. Goodness of Fit Statistics • Chi Square CHAID (Fish, Gallagher, Monroe- Discussion Paper Program, 1990) • Deviance CART Francis Analytics and Actuarial Data Mining, Inc.

  10. Goodness of Fit Statistics • Gini Measure  CART Francis Analytics and Actuarial Data Mining, Inc.

  11. Goodness of Fit Statistics • Entropy C4.5 Francis Analytics and Actuarial Data Mining, Inc.

  12. First Split All Policyholders P = 1.00 Territory = North / South P = 0.11 Territory = East / West P = 0.06 Francis Analytics and Actuarial Data Mining, Inc.

  13. Example of Goodness of Fit Calculation Francis Analytics and Actuarial Data Mining, Inc.

  14. Example of Fitted Tree Francis Analytics and Actuarial Data Mining, Inc.

  15. MARS • Multivariate Adaptive Regression Splines • An extension of regression which • Uses automated search procedures • Models nonlinearities • Models interactions • Produces a regression-like formula Francis Analytics and Actuarial Data Mining, Inc.

  16. Nonlinear Relationships • Fits piecewise regression to continuous variables Francis Analytics and Actuarial Data Mining, Inc.

  17. Interactions • Fits basis functions (which are like dummy variables) to model interactions • An interaction between Territory=East and Mileage can be modeled by a dummy variable which is 1 if the Territory=East and mileage =High and 0 otherwise. Francis Analytics and Actuarial Data Mining, Inc.

  18. Goodness of Fit Statistics • Generalized Cross-Validation Francis Analytics and Actuarial Data Mining, Inc.

  19. Fitted MARS Model Francis Analytics and Actuarial Data Mining, Inc.

  20. Neural Networks • Developed by artificial intelligence experts – but now used by statisticians also • Based on how neurons function in brain Francis Analytics and Actuarial Data Mining, Inc.

  21. Neural Network Structure Francis Analytics and Actuarial Data Mining, Inc.

  22. Neural Networks • Fit by minimizing squared deviation between fitted and actual values • Can be viewed as a non-parametric, non-linear regression • Often thought of as a “black box” • Due to complexity of fitted model it is difficult to understand relationship between dependent and predictor variables Francis Analytics and Actuarial Data Mining, Inc.

  23. Understanding the Model:Variable Importance • Look at weights to hidden layer • Compute sensitivities: • a measure of how much the predicted value’s error increases when the variables are excluded from the model one at a time Francis Analytics and Actuarial Data Mining, Inc.

  24. Importance Ranking • Neural Network and Mars ranked variables in same order Francis Analytics and Actuarial Data Mining, Inc.

  25. Visualizing Fitted Neural Network Francis Analytics and Actuarial Data Mining, Inc.

  26. ROC Curves for the Data Mining Methods Francis Analytics and Actuarial Data Mining, Inc.

  27. Correlation • Variable gender added • Its only impact on probability of a claim: correlation with mileage variable – males had higher mileage • MARS did not use the variable in model • CART used it in two places to split tree • Neural Network ranked gender as least important variable Francis Analytics and Actuarial Data Mining, Inc.

  28. How the Methods DidCorrelation with “True” Claim Frequency Francis Analytics and Actuarial Data Mining, Inc.

  29. Unsupervised Learning • Common Method: Clustering • No dependent variable – records are grouped into classes with similar values on the variable • Start with a measure of similarity or dissimilarity • Maximize dissimilarity between members of different clusters Francis Analytics and Actuarial Data Mining, Inc.

  30. Dissimilarity (Distance) Measure • Euclidian Distance • Manhattan Distance Francis Analytics and Actuarial Data Mining, Inc.

  31. Binary Variables Francis Analytics and Actuarial Data Mining, Inc.

  32. Binary Variables • Sample Matching • Rogers and Tanimoto Francis Analytics and Actuarial Data Mining, Inc.

  33. Example: Fraud Data • Data from 1993 closed claim study conducted by Automobile Insurers Bureau of Massachusetts • Claim files often have variables which may be useful in assessing suspicion of fraud, but a dependent variable is often not available • Variables used for clustering: • Injury type • Provider type • Legal representation • Prior Claim • SIU Investigation Francis Analytics and Actuarial Data Mining, Inc.

  34. Results for 2 Clusters Francis Analytics and Actuarial Data Mining, Inc.

  35. Beginners Library • Berry, Michael J. A., and Linoff, Gordon, Data Mining Techniques, John Wiley and Sons, 1997 • Kaufman, Leonard and Rousseeuw, Peter, Finding Groups in Data, John Wiley and Sons, 1990 • Smith, Murry, Neural Networks for Statistical Modeling, International Thompson Computer Press, 1996 Francis Analytics and Actuarial Data Mining, Inc.

  36. Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa. Louise Francis, FCAS, MAAA Louise_francis@msn.com Francis Analytics and Actuarial Data Mining, Inc.

More Related