taming text an introduction to text mining cas 2006 ratemaking seminar
Download
Skip this Video
Download Presentation
Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar

Loading in 2 Seconds...

play fullscreen
1 / 34

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar. Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 [email protected] www.data-mines.com. Objectives. Present a new data mining technology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar' - hamish-hartman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
taming text an introduction to text mining cas 2006 ratemaking seminar

Taming Text: An Introduction to Text MiningCAS 2006 Ratemaking Seminar

Prepared by

Louise Francis

Francis Analytics and Actuarial Data Mining, Inc.

April 1, 2006

[email protected]

www.data-mines.com

objectives
Objectives
  • Present a new data mining technology
  • Show how the technology uses a combination of
    • String processing functions
    • Common multivariate procedures available in statistical most statistical software
  • Present a simple example of text mining
  • Discuss practical issues for implementing the methods
actuarial rocket science
Actuarial Rocket Science
  • Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications
  • The methods are typically applied to large, complex databases
  • One of the newest of these is text mining
major kinds of modeling
Supervised learning

Most common situation

A dependent variable

Frequency

Loss ratio

Fraud/no fraud

Some methods

Regression

CART

Some neural networks

Unsupervised learning

No dependent variable

Group like records together

A group of claims with similar characteristics might be more likely to be fraudulent

Ex: Territory assignment, Text Mining

Some methods

Association rules

K-means clustering

Kohonen neural networks

Major Kinds of Modeling
objective
Objective
  • Create a new variable from free form text
  • Use words in injury description to create an injury code
  • New injury code can be used in a predictive model or in other analysis
a two step process
A Two - Step Process
  • Use string manipulation functions to parse the text
    • Search for blanks, commas, periods and other word separators
    • Use the separators to extract words
    • Eliminate stopwords
  • Use multivariate techniques to cluster like terms together into the same injury code
    • Cluster analysis
    • Factor and Principal Components analysis
eliminate stopwords
Eliminate Stopwords
  • Common words with no meaningful content
the two major categories of dimension reduction
The Two Major Categories of Dimension Reduction
  • Variable reduction
    • Factor Analysis
    • Principal Components Analysis
  • Record reduction
    • Clustering
  • Other methods tend to be developments on these
clustering
Clustering
  • Common Method: k-means and hierarchical clustering
  • No dependent variable – records are grouped into classes with similar values on the variable
  • Start with a measure of similarity or dissimilarity
  • Maximize dissimilarity between members of different clusters
binary variables1
Binary Variables
  • Sample Matching
  • Rogers and Tanimoto
k means clustering
K-Means Clustering
  • Determine ahead of time how many clusters or groups you want
  • Use dissimilarity measure to assign all records to one of the clusters
hierarchical clustering
Hierarchical Clustering
  • A stepwise procedure
  • At beginning, each records is its own cluster
  • Combine the most similar records into a single cluster
  • Repeat process until there is only one cluster with every record in it
how many clusters
How Many Clusters?
  • Use statistics on strength of relationship to variables of interest
a statistical test for number of clusters
A Statistical Test for Number of Clusters
  • Swartz Bayesian Information Criterion
software for text mining commercial software
Software for Text Mining-Commercial Software
  • Most major software companies, as well as some specialists sell text mining software
    • These products tend to be for large complicated applications, such as classifying academic papers
    • They also tend to be expensive
  • One inexpensive product reviewed by American Statistician had disappointing performance
software for text mining free software
Software for Text Mining – Free Software
  • A free product, TMSK, was used for much of the paper’s analysis
  • Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R )
  • Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org)
software used for text mining
Software used for Text Mining

Perl, TMSK, S-PLUS, SPSS

SPSS, SPLUS, SAS

slide31
Perl
  • Free open source programming language
  • www.perl.org
  • Used a lot for text processing
  • Perl for Dummies gives a good introduction
perl functions for parsing
Perl Functions for Parsing
  • $TheFile ="GLClaims.txt";
  • $Linelength=length($TheFile);
  • open(INFILE, $TheFile) or die "File not found";
  • # Initialize variables
  • $Linecount=0;
  • @alllines=();
  • while(<INFILE>){
  • $Theline=$_;
  • chomp($Theline);
  • $Linecount = $Linecount+1;
  • $Linelength=length($Theline);
  • @Newitems = split(/ /,$Theline);
  • print "@Newitems \n";
  • push(@alllines, [@Newitems]);
  • } # end while
references
References
  • Hoffman, P, Perl for Dummies, Wiley, 2003
  • Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005
ad