Taming text an introduction to text mining cas 2006 ratemaking seminar
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar. Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 [email protected] www.data-mines.com. Objectives. Present a new data mining technology

Download Presentation

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Taming text an introduction to text mining cas 2006 ratemaking seminar

Taming Text: An Introduction to Text MiningCAS 2006 Ratemaking Seminar

Prepared by

Louise Francis

Francis Analytics and Actuarial Data Mining, Inc.

April 1, 2006

[email protected]

www.data-mines.com


Objectives

Objectives

  • Present a new data mining technology

  • Show how the technology uses a combination of

    • String processing functions

    • Common multivariate procedures available in statistical most statistical software

  • Present a simple example of text mining

  • Discuss practical issues for implementing the methods


Actuarial rocket science

Actuarial Rocket Science

  • Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications

  • The methods are typically applied to large, complex databases

  • One of the newest of these is text mining


Major kinds of modeling

Supervised learning

Most common situation

A dependent variable

Frequency

Loss ratio

Fraud/no fraud

Some methods

Regression

CART

Some neural networks

Unsupervised learning

No dependent variable

Group like records together

A group of claims with similar characteristics might be more likely to be fraudulent

Ex: Territory assignment, Text Mining

Some methods

Association rules

K-means clustering

Kohonen neural networks

Major Kinds of Modeling


Text mining uses growing in many areas

Text Mining: Uses Growing in Many Areas

ECHELON Program


Lots of information but no data

Lots of Information, but no Data


Example claim description field

Example: Claim Description Field


Objective

Objective

  • Create a new variable from free form text

  • Use words in injury description to create an injury code

  • New injury code can be used in a predictive model or in other analysis


A two step process

A Two - Step Process

  • Use string manipulation functions to parse the text

    • Search for blanks, commas, periods and other word separators

    • Use the separators to extract words

    • Eliminate stopwords

  • Use multivariate techniques to cluster like terms together into the same injury code

    • Cluster analysis

    • Factor and Principal Components analysis


Parsing a claim description field with microsoft excel string functions

Parsing a Claim Description Field With Microsoft Excel String Functions


Extraction creates binary indicator variables

Extraction Creates Binary Indicator Variables


Eliminate stopwords

Eliminate Stopwords

  • Common words with no meaningful content


Stemming identify synonyms and words with common stem

Stemming: Identify Synonyms and Words with Common Stem


Dimension reduction

Dimension Reduction


The two major categories of dimension reduction

The Two Major Categories of Dimension Reduction

  • Variable reduction

    • Factor Analysis

    • Principal Components Analysis

  • Record reduction

    • Clustering

  • Other methods tend to be developments on these


Correlated dimensions

Correlated Dimensions


Clustering

Clustering

  • Common Method: k-means and hierarchical clustering

  • No dependent variable – records are grouped into classes with similar values on the variable

  • Start with a measure of similarity or dissimilarity

  • Maximize dissimilarity between members of different clusters


Dissimilarity distance measure continuous variables

Dissimilarity (Distance) Measure – Continuous Variables

  • Euclidian Distance

  • Manhattan Distance


Binary variables

Binary Variables


Binary variables1

Binary Variables

  • Sample Matching

  • Rogers and Tanimoto


K means clustering

K-Means Clustering

  • Determine ahead of time how many clusters or groups you want

  • Use dissimilarity measure to assign all records to one of the clusters


Hierarchical clustering

Hierarchical Clustering

  • A stepwise procedure

  • At beginning, each records is its own cluster

  • Combine the most similar records into a single cluster

  • Repeat process until there is only one cluster with every record in it


Hierarchical clustering example

Hierarchical Clustering Example


How many clusters

How Many Clusters?

  • Use statistics on strength of relationship to variables of interest


A statistical test for number of clusters

A Statistical Test for Number of Clusters

  • Swartz Bayesian Information Criterion


Final cluster selection

Final Cluster Selection


Use new injury code in a logistic regression to predict serious claims

Use New Injury Code in a Logistic Regression to Predict Serious Claims


Software for text mining commercial software

Software for Text Mining-Commercial Software

  • Most major software companies, as well as some specialists sell text mining software

    • These products tend to be for large complicated applications, such as classifying academic papers

    • They also tend to be expensive

  • One inexpensive product reviewed by American Statistician had disappointing performance


Software for text mining free software

Software for Text Mining – Free Software

  • A free product, TMSK, was used for much of the paper’s analysis

  • Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R )

  • Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org)


Software used for text mining

Software used for Text Mining

Perl, TMSK, S-PLUS, SPSS

SPSS, SPLUS, SAS


Taming text an introduction to text mining cas 2006 ratemaking seminar

Perl

  • Free open source programming language

  • www.perl.org

  • Used a lot for text processing

  • Perl for Dummies gives a good introduction


Perl functions for parsing

Perl Functions for Parsing

  • $TheFile ="GLClaims.txt";

  • $Linelength=length($TheFile);

  • open(INFILE, $TheFile) or die "File not found";

  • # Initialize variables

  • $Linecount=0;

  • @alllines=();

  • while(<INFILE>){

  • $Theline=$_;

  • chomp($Theline);

  • $Linecount = $Linecount+1;

  • $Linelength=length($Theline);

  • @Newitems = split(/ /,$Theline);

  • print "@Newitems \n";

  • push(@alllines, [@Newitems]);

  • } # end while


References

References

  • Hoffman, P, Perl for Dummies, Wiley, 2003

  • Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005


Questions

Questions?


  • Login