Data mining best practices part 2
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Data Mining – Best Practices Part #2 PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on
  • Presentation posted in: General

Data Mining – Best Practices Part #2. Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008. Data Mining.

Download Presentation

Data Mining – Best Practices Part #2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data mining best practices part 2

Data Mining – Best PracticesPart #2

Richard Derrig, PhD,

Opal Consulting LLC

CAS Spring Meeting

June 16-18, 2008


Data mining

Data Mining

  • Data Mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. In order to achieve this, data mining uses computational techniques from statistics, machine learning and pattern recognition.

    • www.wikipedia.org


Agenda

AGENDA

Predictive v Explanatory Models

Discussion of Methods

Example: Explanatory Models for Decision to Investigate Claims

The “Importance” of Explanatory and Predictive Variables

An Eight Step Program for Building

a Successful Model


Predictive v explanatory models

Predictive v Explanatory Models

  • Both are of the form: Target or Dependent Variable is a Function of Feature or Independent Variables that are related to the Target Variable

  • Explanatory Models assume all Variables are Contemporaneous and Known

  • Predictive Models assume all Variables are Contemporaneous and Estimable


Desirable properties of a data mining method

Desirable Properties of a Data Mining Method:

  • Any nonlinear relationship between target and features can be approximated

  • A method that works when the form of the nonlinearity is unknown

  • The effect of interactions can be easily determined and incorporated into the model

  • The method generalizes well on out-of sample data


Major kinds of data mining methods

Supervised learning

Most common situation

Target variable

Frequency

Loss ratio

Fraud/no fraud

Some methods

Regression

Decision Trees

Some neural networks

Unsupervised learning

No Target variable

Group like records together-Clustering

A group of claims with similar characteristics might be more likely to be of similar risk of loss

Ex: Territory assignment,

Some methods

PRIDIT

K-means clustering

Kohonen neural networks

Major Kinds of Data Mining Methods


The supervised methods and software evaluated

The Supervised Methods and Software Evaluated

1) TREENET7) Iminer Ensemble

2) Iminer Tree8) MARS

3) SPLUS Tree9) Random Forest

4) CART10) Exhaustive Chaid

5) S-PLUS Neural11) Naïve Bayes (Baseline)

6) Iminer Neural 12) Logistic reg ( (Baseline)


Decision trees

Decision Trees

  • In decision theory (for example risk management), a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions. A decision tree is a special form of tree structure.

    • www.wikipedia.org


Cart example of 1 st split on provider 2 bill with paid as dependent

CART – Example of 1st split on Provider 2 Bill, With Paid as Dependent

  • For the entire database, total squared deviation of paid losses around the predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013 after the data are partitioned using $5,021 as the cutpoint.

  • Any other partition of the provider bill produces a larger SSE than 4.66x1013. For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.


Different kinds of decision trees

Different Kinds of Decision Trees

  • Single Trees (CART, CHAID)

  • Ensemble Trees, a more recent development (TREENET, RANDOM FOREST)

    • A composite or weighted average of many trees (perhaps 100 or more)

    • There are many methods to fit the trees and prevent overfitting

      • Boosting: Iminer Ensemble and Treenet

      • Bagging: Random Forest


Neural networks

Neural Networks

=


Neural networks1

NEURAL NETWORKS

  • Self-Organizing Feature Maps

    • T. Kohonen 1982-1990 (Cybernetics)

    • Reference vectors of features map to OUTPUT format in topologically faithful way. Example: Map onto 40x40 2-dimensional square.

    • Iterative Process Adjusts All Reference Vectors in a “Neighborhood” of the Nearest One. Neighborhood Size Shrinks over Iterations


Feature map suspicion levels

FEATURE MAPSUSPICION LEVELS


Feature map similiarity of a claim

FEATURE MAPSIMILIARITY OF A CLAIM


Data modeling example clustering

DATA MODELING EXAMPLE: CLUSTERING

  • Data on 16,000 Medicaid providers analyzed by unsupervised neural net

  • Neural network clustered Medicaid providers based on 100+ features

  • Investigators validated a small set of known fraudulent providers

  • Visualization tool displays clustering, showing known fraud and abuse

  • Subset of 100 providers with similar patterns investigated: Hit rate > 70%

Cube size proportional to annual Medicaid revenues

© 1999 Intelligent Technologies Corporation


Multiple adaptive regression splines mars

Multiple Adaptive Regression Splines (MARS)

  • MARS fits a piecewise linear regression

    • BF1 = max(0, X – 1,401.00)

    • BF2 = max(0, 1,401.00 - X )

    • BF3 = max(0, X - 70.00)

    • Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2 - .145947E-03 * BF3; BF1 is basis function

    • BF1, BF2, BF3 are basis functions

  • MARS uses statistical optimization to find best basis function(s)

  • Basis function similar to dummy variable in regression. Like a combination of a dummy indicator and a linear independent variable


Baseline methods naive bayes classifier logistic regression

Baseline Methods:Naive Bayes ClassifierLogistic Regression

  • Naive Bayes assumes feature (predictor) variables) independence conditional on each category

  • Logistic Regression assumes target is linear in the logs of the feature (predictor) variables


Real claim fraud detection problem

REAL CLAIM FRAUDDETECTION PROBLEM

  • Classify all claims

  • Identify valid classes

    • Pay the claim

    • No hassle

    • Visa Example

  • Identify (possible) fraud

    • Investigation needed

  • Identify “gray” classes

    • Minimize with “learning” algorithms


The fraud surrogates used as target decision variables

The Fraud Surrogates used as Target Decision Variables

  • Independent Medical Exam (IME) requested

  • Special Investigation Unit (SIU) referral

  • IME successful

  • SIU successful

  • DATA: Detailed Auto Injury Closed Claim Database for Massachusetts

  • Accident Years (1995-1997)


Roc curve area under the roc curve

ROC Curve Area Under the ROC Curve

  • Want good performance both on sensitivity and specificity

  • Sensitivity and specificity depend on cut points chosen for binary target (yes/no)

    • Choose a series of different cut points, and compute sensitivity and specificity for each of them

    • Graph results

      • Plot sensitivity vs 1-specifity

      • Compute an overall measure of “lift”, or area under the curve


True false positives and true false negatives the confusion matrix

True/False Positives and True/False Negatives: The “Confusion” Matrix

  • Choose a “cut point” in the model score.

  • Claims > cut point, classify “yes”.


Treenet roc curve ime auroc 0 701

TREENET ROC Curve – IMEAUROC = 0.701


Logistic roc curve ime auroc 0 643

Logistic ROC Curve – IMEAUROC = 0.643


Ranking of methods software ime requested

Ranking of Methods/Software – IME Requested


Variable importance ime based on average of methods

Variable Importance (IME) Based on Average of Methods


Claim fraud detection plan

Claim Fraud Detection Plan

  • STEP 1:SAMPLE: Systematic benchmark of a random sample of claims.

  • STEP 2:FEATURES: Isolate red flags and other sorting characteristics

  • STEP 3:FEATURE SELECTION: Separate features into objective and subjective, early, middle and late arriving, acquisition cost levels, and other practical considerations.

  • STEP 4:CLUSTER: Apply unsupervised algorithms (Kohonen, PRIDIT, Fuzzy) to cluster claims, examine for needed homogeneity.


Claim fraud detection plan1

Claim Fraud Detection Plan

  • STEP 5:ASSESSMENT: Externally classify claims according to objectives for sorting.

  • STEP 6:MODEL: Supervised models relating selected features to objectives (logistic regression, Naïve Bayes, Neural Networks, CART, MARS)

  • STEP7:STATIC TESTING: Model output versus expert assessment, model output versus cluster homogeneity (PRIDIT scores) on one or more samples.

  • STEP 8:DYNAMIC TESTING: Real time operation of acceptable model, record outcomes, repeat steps 1-7 as needed to fine tune model and parameters. Use PRIDIT to show gain or loss of feature power and changing data patterns, tune investigative proportions to optimize detection and deterrence of fraud and abuse.


  • Login