Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

1 / 35

# Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation - PowerPoint PPT Presentation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1. Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar. Classification: Definition.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation' - mavis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Lecture Notes for Chapter 4

Introduction to Data Mining

by

Tan, Steinbach, Kumar

Classification: Definition
• Given a collection of records (training set )
• Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label
• x: attribute, predictor, independent variable, input
• y: class, response, dependent variable, output
• Learn a model that maps each attribute set x into one of the predefined class labels y
Classification Techniques
• Base Classifiers
• Decision Tree based Methods
• Rule-based Methods
• Nearest-neighbor
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Ensemble Classifiers
• Boosting, Bagging, Random Forests
Example of a Decision Tree

categorical

categorical

continuous

class

Splitting Attributes

Home Owner

Yes

No

NO

MarSt

Married

Single, Divorced

Income

NO

< 80K

> 80K

YES

NO

Model: Decision Tree

Training Data

NO

Another Example of Decision Tree

categorical

categorical

continuous

class

Single, Divorced

MarSt

Married

NO

Home Owner

No

Yes

Income

< 80K

> 80K

YES

NO

There could be more than one tree that fits the same data!

Decision Tree Induction
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that reach a node t
• General Procedure:
• If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
• If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Dt

?

How to determine the Best Split

Before Splitting: 10 records of class 0, 10 records of class 1

Which test condition is the best?

How to determine the Best Split
• Greedy approach:
• Nodes with purer class distribution are preferred
• Need a measure of node impurity:

High degree of impurity

Low degree of impurity

Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Measure of Impurity: GINI
• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most interesting information
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
• Larger and Purer Partitions are sought for.

B?

Yes

No

Node N1

Node N2

Gini(N1) = 1 – (5/6)2 – (1/6)2= 0.278

Gini(N2) = 1 – (2/6)2 – (4/6)2= 0.444

Gini(Children) = 6/12 * 0.278 + 6/12 * 0.444= 0.361

Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions

Multi-way split

Two-way split

(find best partition of values)

Decision Tree Based Classification
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple data sets
Rule-Based Classifier

Classify records by using a collection of “if…then…” rules

R1: (Give Birth = no)  (Can Fly = yes)  Birds

R2: (Give Birth = no)  (Live in Water = yes)  Fishes

R3: (Give Birth = yes)  (Blood Type = warm)  Mammals

R4: (Give Birth = no)  (Can Fly = no)  Reptiles

R5: (Live in Water = sometimes)  Amphibians

Nearest Neighbor Classifiers

Basic idea:

If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute Distance

Test Record

Training Records

Choose k of the “nearest” records

Bayes Classifier

A probabilistic framework for solving classification problems

Key idea is that certain attribute values are more likely (probable) for some classes than for others

Example: Probability an individual is a male or female if the individual is wearing a dress

Conditional Probability:

Bayes theorem:

Confusion Matrix:

Evaluating Classifiers

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Holdout

Reserve k% for training and (100-k)% for testing

Random subsampling

Repeated holdout

Cross validation

Partition data into k disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one

Leave-one-out: k=n

Bootstrap

Sampling with replacement

.632 bootstrap:

Methods for Classifier Evaluation

Consider a 2-class problem

Number of Class 0 examples = 9990

Number of Class 1 examples = 10

If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

This is misleading because the model does not detect any class 1 example

Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

Problem with Accuracy

Example of classification accuracy measures

Accuracy = 0.8

For Yes class: precision = 87.5, recall = 87.5, F-measure = 87.5

For No class: precision = 0.5, recall = 0.5, F-measure = 0.5

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

26

Example of classification accuracy measures

Accuracy = 0.9450

Sensitivity = 0.99

Specificity = 0.90

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

27

Measures of Classification Performance

 is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).

 is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN).

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

28

A graphical approach for displaying trade-off between detection rate and false alarm rate

Developed in 1950s for signal detection theory to analyze noisy signals

ROC curve plots True Positive Rate (TPR) against (False Positive Rate) FPR

Performance of a model represented as a point in an ROC curve

Changing the threshold parameter of classifier changes the location of the point

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

29

ROC Curve

(TPR,FPR):

• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal
• Diagonal line:
• Random guessing
• Below diagonal line:
• prediction is opposite of the true class

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

30

Using ROC for Model Comparison
• No model consistently outperforms the other
• M1 is better for small FPR
• M2 is better for large FPR
• Area Under the ROC curve
• Ideal: Area = 1
• Random guess: Area = 0.5

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

31

To draw ROC curve, classifier must produce continuous-valued output

Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record

Many classifiers produce only discrete outputs (i.e., predicted class)

Approaches to get ROC curve for other types of classifiers such as decision trees

WEKA gives you ROC curves

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

32

ROC Curve Example

- 1-dimensional data set containing 2 classes (positive and negative)

- Any points located at x > t is classified as positive

At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

33

How to Construct an ROC curve
• Use classifier that produces continuous-valued output for each test instance score(+|A)
• Sort the instances according to score(+|A) in decreasing order
• Apply threshold at each unique value of score(+|A)
• Count the number of TP, FP, TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

34

How to construct an ROC curve

Threshold >=

ROC Curve:

02/14/2011

CSCI 8980: Spring 2011: Mining Biomedical Data

35