1 / 34

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar. Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 Louise_francis@msn.com www.data-mines.com. Objectives. Present a new data mining technology

alban
Download Presentation

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taming Text: An Introduction to Text MiningCAS 2006 Ratemaking Seminar Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 Louise_francis@msn.com www.data-mines.com

  2. Objectives • Present a new data mining technology • Show how the technology uses a combination of • String processing functions • Common multivariate procedures available in statistical most statistical software • Present a simple example of text mining • Discuss practical issues for implementing the methods

  3. Actuarial Rocket Science • Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications • The methods are typically applied to large, complex databases • One of the newest of these is text mining

  4. Supervised learning Most common situation A dependent variable Frequency Loss ratio Fraud/no fraud Some methods Regression CART Some neural networks Unsupervised learning No dependent variable Group like records together A group of claims with similar characteristics might be more likely to be fraudulent Ex: Territory assignment, Text Mining Some methods Association rules K-means clustering Kohonen neural networks Major Kinds of Modeling

  5. Text Mining: Uses Growing in Many Areas ECHELON Program

  6. Lots of Information, but no Data

  7. Example: Claim Description Field

  8. Objective • Create a new variable from free form text • Use words in injury description to create an injury code • New injury code can be used in a predictive model or in other analysis

  9. A Two - Step Process • Use string manipulation functions to parse the text • Search for blanks, commas, periods and other word separators • Use the separators to extract words • Eliminate stopwords • Use multivariate techniques to cluster like terms together into the same injury code • Cluster analysis • Factor and Principal Components analysis

  10. Parsing a Claim Description Field With Microsoft Excel String Functions

  11. Extraction Creates Binary Indicator Variables

  12. Eliminate Stopwords • Common words with no meaningful content

  13. Stemming: Identify Synonyms and Words with Common Stem

  14. Dimension Reduction

  15. The Two Major Categories of Dimension Reduction • Variable reduction • Factor Analysis • Principal Components Analysis • Record reduction • Clustering • Other methods tend to be developments on these

  16. Correlated Dimensions

  17. Clustering • Common Method: k-means and hierarchical clustering • No dependent variable – records are grouped into classes with similar values on the variable • Start with a measure of similarity or dissimilarity • Maximize dissimilarity between members of different clusters

  18. Dissimilarity (Distance) Measure – Continuous Variables • Euclidian Distance • Manhattan Distance

  19. Binary Variables

  20. Binary Variables • Sample Matching • Rogers and Tanimoto

  21. K-Means Clustering • Determine ahead of time how many clusters or groups you want • Use dissimilarity measure to assign all records to one of the clusters

  22. Hierarchical Clustering • A stepwise procedure • At beginning, each records is its own cluster • Combine the most similar records into a single cluster • Repeat process until there is only one cluster with every record in it

  23. Hierarchical Clustering Example

  24. How Many Clusters? • Use statistics on strength of relationship to variables of interest

  25. A Statistical Test for Number of Clusters • Swartz Bayesian Information Criterion

  26. Final Cluster Selection

  27. Use New Injury Code in a Logistic Regression to Predict Serious Claims

  28. Software for Text Mining-Commercial Software • Most major software companies, as well as some specialists sell text mining software • These products tend to be for large complicated applications, such as classifying academic papers • They also tend to be expensive • One inexpensive product reviewed by American Statistician had disappointing performance

  29. Software for Text Mining – Free Software • A free product, TMSK, was used for much of the paper’s analysis • Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R ) • Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org)

  30. Software used for Text Mining Perl, TMSK, S-PLUS, SPSS SPSS, SPLUS, SAS

  31. Perl • Free open source programming language • www.perl.org • Used a lot for text processing • Perl for Dummies gives a good introduction

  32. Perl Functions for Parsing • $TheFile ="GLClaims.txt"; • $Linelength=length($TheFile); • open(INFILE, $TheFile) or die "File not found"; • # Initialize variables • $Linecount=0; • @alllines=(); • while(<INFILE>){ • $Theline=$_; • chomp($Theline); • $Linecount = $Linecount+1; • $Linelength=length($Theline); • @Newitems = split(/ /,$Theline); • print "@Newitems \n"; • push(@alllines, [@Newitems]); • } # end while

  33. References • Hoffman, P, Perl for Dummies, Wiley, 2003 • Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005

  34. Questions?

More Related