Taming text an introduction to text mining cas 2006 ratemaking seminar
Download
1 / 34

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar. Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 [email protected] www.data-mines.com. Objectives. Present a new data mining technology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar' - hamish-hartman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Taming text an introduction to text mining cas 2006 ratemaking seminar

Taming Text: An Introduction to Text MiningCAS 2006 Ratemaking Seminar

Prepared by

Louise Francis

Francis Analytics and Actuarial Data Mining, Inc.

April 1, 2006

[email protected]

www.data-mines.com


Objectives
Objectives

  • Present a new data mining technology

  • Show how the technology uses a combination of

    • String processing functions

    • Common multivariate procedures available in statistical most statistical software

  • Present a simple example of text mining

  • Discuss practical issues for implementing the methods


Actuarial rocket science
Actuarial Rocket Science

  • Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications

  • The methods are typically applied to large, complex databases

  • One of the newest of these is text mining


Major kinds of modeling

Supervised learning

Most common situation

A dependent variable

Frequency

Loss ratio

Fraud/no fraud

Some methods

Regression

CART

Some neural networks

Unsupervised learning

No dependent variable

Group like records together

A group of claims with similar characteristics might be more likely to be fraudulent

Ex: Territory assignment, Text Mining

Some methods

Association rules

K-means clustering

Kohonen neural networks

Major Kinds of Modeling





Objective
Objective

  • Create a new variable from free form text

  • Use words in injury description to create an injury code

  • New injury code can be used in a predictive model or in other analysis


A two step process
A Two - Step Process

  • Use string manipulation functions to parse the text

    • Search for blanks, commas, periods and other word separators

    • Use the separators to extract words

    • Eliminate stopwords

  • Use multivariate techniques to cluster like terms together into the same injury code

    • Cluster analysis

    • Factor and Principal Components analysis




Eliminate stopwords
Eliminate Stopwords String Functions

  • Common words with no meaningful content



Dimension reduction
Dimension Reduction String Functions


The two major categories of dimension reduction
The Two Major Categories of Dimension Reduction String Functions

  • Variable reduction

    • Factor Analysis

    • Principal Components Analysis

  • Record reduction

    • Clustering

  • Other methods tend to be developments on these


Correlated dimensions
Correlated Dimensions String Functions


Clustering
Clustering String Functions

  • Common Method: k-means and hierarchical clustering

  • No dependent variable – records are grouped into classes with similar values on the variable

  • Start with a measure of similarity or dissimilarity

  • Maximize dissimilarity between members of different clusters


Dissimilarity distance measure continuous variables
Dissimilarity (Distance) Measure – Continuous Variables String Functions

  • Euclidian Distance

  • Manhattan Distance


Binary variables
Binary Variables String Functions


Binary variables1
Binary Variables String Functions

  • Sample Matching

  • Rogers and Tanimoto


K means clustering
K-Means Clustering String Functions

  • Determine ahead of time how many clusters or groups you want

  • Use dissimilarity measure to assign all records to one of the clusters


Hierarchical clustering
Hierarchical Clustering String Functions

  • A stepwise procedure

  • At beginning, each records is its own cluster

  • Combine the most similar records into a single cluster

  • Repeat process until there is only one cluster with every record in it



How many clusters
How Many Clusters? String Functions

  • Use statistics on strength of relationship to variables of interest


A statistical test for number of clusters
A Statistical Test for Number of Clusters String Functions

  • Swartz Bayesian Information Criterion


Final cluster selection
Final Cluster Selection String Functions



Software for text mining commercial software
Software for Text Mining-Commercial Software Serious Claims

  • Most major software companies, as well as some specialists sell text mining software

    • These products tend to be for large complicated applications, such as classifying academic papers

    • They also tend to be expensive

  • One inexpensive product reviewed by American Statistician had disappointing performance


Software for text mining free software
Software for Text Mining – Free Software Serious Claims

  • A free product, TMSK, was used for much of the paper’s analysis

  • Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R )

  • Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org)


Software used for text mining
Software used for Text Mining Serious Claims

Perl, TMSK, S-PLUS, SPSS

SPSS, SPLUS, SAS


Perl Serious Claims

  • Free open source programming language

  • www.perl.org

  • Used a lot for text processing

  • Perl for Dummies gives a good introduction


Perl functions for parsing
Perl Functions for Parsing Serious Claims

  • $TheFile ="GLClaims.txt";

  • $Linelength=length($TheFile);

  • open(INFILE, $TheFile) or die "File not found";

  • # Initialize variables

  • $Linecount=0;

  • @alllines=();

  • while(<INFILE>){

  • $Theline=$_;

  • chomp($Theline);

  • $Linecount = $Linecount+1;

  • $Linelength=length($Theline);

  • @Newitems = split(/ /,$Theline);

  • print "@Newitems \n";

  • push(@alllines, [@Newitems]);

  • } # end while


References
References Serious Claims

  • Hoffman, P, Perl for Dummies, Wiley, 2003

  • Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005


Questions
Questions? Serious Claims


ad