Taming text an introduction to text mining cas 2006 ratemaking seminar
1 / 34

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar. Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 Louise_francis@msn.com www.data-mines.com. Objectives. Present a new data mining technology

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Taming text an introduction to text mining cas 2006 ratemaking seminar

Taming Text: An Introduction to Text MiningCAS 2006 Ratemaking Seminar

Prepared by

Louise Francis

Francis Analytics and Actuarial Data Mining, Inc.

April 1, 2006





  • Present a new data mining technology

  • Show how the technology uses a combination of

    • String processing functions

    • Common multivariate procedures available in statistical most statistical software

  • Present a simple example of text mining

  • Discuss practical issues for implementing the methods

Actuarial rocket science

Actuarial Rocket Science

  • Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications

  • The methods are typically applied to large, complex databases

  • One of the newest of these is text mining

Major kinds of modeling

Supervised learning

Most common situation

A dependent variable


Loss ratio

Fraud/no fraud

Some methods



Some neural networks

Unsupervised learning

No dependent variable

Group like records together

A group of claims with similar characteristics might be more likely to be fraudulent

Ex: Territory assignment, Text Mining

Some methods

Association rules

K-means clustering

Kohonen neural networks

Major Kinds of Modeling

Text mining uses growing in many areas

Text Mining: Uses Growing in Many Areas


Lots of information but no data

Lots of Information, but no Data

Example claim description field

Example: Claim Description Field



  • Create a new variable from free form text

  • Use words in injury description to create an injury code

  • New injury code can be used in a predictive model or in other analysis

A two step process

A Two - Step Process

  • Use string manipulation functions to parse the text

    • Search for blanks, commas, periods and other word separators

    • Use the separators to extract words

    • Eliminate stopwords

  • Use multivariate techniques to cluster like terms together into the same injury code

    • Cluster analysis

    • Factor and Principal Components analysis

Parsing a claim description field with microsoft excel string functions

Parsing a Claim Description Field With Microsoft Excel String Functions

Extraction creates binary indicator variables

Extraction Creates Binary Indicator Variables

Eliminate stopwords

Eliminate Stopwords

  • Common words with no meaningful content

Stemming identify synonyms and words with common stem

Stemming: Identify Synonyms and Words with Common Stem

Dimension reduction

Dimension Reduction

The two major categories of dimension reduction

The Two Major Categories of Dimension Reduction

  • Variable reduction

    • Factor Analysis

    • Principal Components Analysis

  • Record reduction

    • Clustering

  • Other methods tend to be developments on these

Correlated dimensions

Correlated Dimensions



  • Common Method: k-means and hierarchical clustering

  • No dependent variable – records are grouped into classes with similar values on the variable

  • Start with a measure of similarity or dissimilarity

  • Maximize dissimilarity between members of different clusters

Dissimilarity distance measure continuous variables

Dissimilarity (Distance) Measure – Continuous Variables

  • Euclidian Distance

  • Manhattan Distance

Binary variables

Binary Variables

Binary variables1

Binary Variables

  • Sample Matching

  • Rogers and Tanimoto

K means clustering

K-Means Clustering

  • Determine ahead of time how many clusters or groups you want

  • Use dissimilarity measure to assign all records to one of the clusters

Hierarchical clustering

Hierarchical Clustering

  • A stepwise procedure

  • At beginning, each records is its own cluster

  • Combine the most similar records into a single cluster

  • Repeat process until there is only one cluster with every record in it

Hierarchical clustering example

Hierarchical Clustering Example

How many clusters

How Many Clusters?

  • Use statistics on strength of relationship to variables of interest

A statistical test for number of clusters

A Statistical Test for Number of Clusters

  • Swartz Bayesian Information Criterion

Final cluster selection

Final Cluster Selection

Use new injury code in a logistic regression to predict serious claims

Use New Injury Code in a Logistic Regression to Predict Serious Claims

Software for text mining commercial software

Software for Text Mining-Commercial Software

  • Most major software companies, as well as some specialists sell text mining software

    • These products tend to be for large complicated applications, such as classifying academic papers

    • They also tend to be expensive

  • One inexpensive product reviewed by American Statistician had disappointing performance

Software for text mining free software

Software for Text Mining – Free Software

  • A free product, TMSK, was used for much of the paper’s analysis

  • Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R )

  • Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org)

Software used for text mining

Software used for Text Mining



Taming text an introduction to text mining cas 2006 ratemaking seminar


  • Free open source programming language

  • www.perl.org

  • Used a lot for text processing

  • Perl for Dummies gives a good introduction

Perl functions for parsing

Perl Functions for Parsing

  • $TheFile ="GLClaims.txt";

  • $Linelength=length($TheFile);

  • open(INFILE, $TheFile) or die "File not found";

  • # Initialize variables

  • $Linecount=0;

  • @alllines=();

  • while(<INFILE>){

  • $Theline=$_;

  • chomp($Theline);

  • $Linecount = $Linecount+1;

  • $Linelength=length($Theline);

  • @Newitems = split(/ /,$Theline);

  • print "@Newitems \n";

  • push(@alllines, [@Newitems]);

  • } # end while



  • Hoffman, P, Perl for Dummies, Wiley, 2003

  • Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005



  • Login