AMCS/CS 340: Data Mining

1 / 85

# AMCS/CS 340: Data Mining - PowerPoint PPT Presentation

Classification: Evaluation. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'AMCS/CS 340: Data Mining' - dunne

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Classification: EvaluationAMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
• Methods for Performance Evaluation
• How to obtain reliable estimates?
• Methods for Model Comparison
• How to compare the relative performance among competing models?

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Metrics for Performance Evaluation
• Focus on the predictive capability of a model
• Rather than how fast it takes to classify or build models, scalability, etc.
• Confusion Matrix:
• Most widely-used metric:

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Limitation of Accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

Unbalanced classes

• If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading because model does not detect any class 1 example

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Other Measures

5

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
• Methods for Performance Evaluation
• How to obtain reliable estimates?
• Methods for Model Comparison
• How to compare the relative performance among competing models?

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets

7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods of Estimation
• Holdout

Reserve 2/3 for training and 1/3 for testing

• Random subsampling

Repeated holdout

• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Bootstrap

Sampling with replacement

8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
• Methods for Performance Evaluation
• How to obtain reliable estimates?
• Methods for Model Comparison
• How to compare the relative performance among competing models?

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of algorithm, or sampledistribution changes the location of the point

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

10

ROC Curve

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Using ROC for Model Comparison
• No model consistently outperform the other
• M1 is better for small FPR
• M2 is better for large FPR
• Area Under the ROC curve
• Ideal:
• Area = 1
• Random guess:
• Area = 0.5

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to construct an ROC curve

Posterior probability of test instance x

Threshold: t

# of + >= t

# of - >= t

ROC Curve:

13

Confidence Interval for Accuracy
• Prediction can be regarded as a Bernoulli trial
• A Bernoulli trial has 2 possible outcomes
• Possible outcomes for prediction: correct or wrong
• Collection of Bernoulli trials has a Binomial distribution:
• x  Bin(N, p) x: number of correct predictions
• e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25
• Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)?

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy

Area = 1 - 

• For large test sets (N > 30),

acchas a normal distribution with mean p and variance p(1-p)/N

• Confidence Interval for p:

Z/2

Z1-  /2

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy
• Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:
• N=100, acc = 0.8
• Let 1- = 0.95 (95% confidence)
• From probability table, Z/2=1.96

Standard Normal distribution

16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test of Significance
• Given two models:
• Model M1: accuracy = 85%, tested on 30 instances
• Model M2: accuracy = 75%, tested on 5000 instances
• Can we say M1 is better than M2?
• How much confidence can we place on accuracy of M1 and M2?
• Can the difference in performance measure be explained as a result of random fluctuations in the test set?

17

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models
• Given two models, say M1 and M2, which is better?
• M1 is tested on D1 (size=n1), found error rate = e1
• M2 is tested on D2 (size=n2), found error rate = e2
• Assume D1 and D2 are independent
• If n1 and n2 are sufficiently large, then
• Approximate of variance (Binomial distribution):

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models
• To test if performance difference is statistically significant:
• d = e1 – e2
• where dt is the true difference
• Since D1 and D2 are independent, their variance adds up:
• At (1-) confidence level,

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An Illustrative Example
• Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25
• d = |e2 – e1| = 0.1 (2-sided test)
• At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Compute Distance

Test Record

Training Records

Choose k of the “nearest” records

Nearest Neighbor Classifiers

22

• Basic idea:
• If it walks like a duck, quacks like a duck, then it’s probably a duck

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

23

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Nearest Neighbor Classifiers
• Requires three things
• The set of stored records
• Distance Metric to compute distance between records
• The value of k, the number of nearest neighbors to retrieve
• To classify an unknown record:
• Compute distance to other training records
• Identify k nearest neighbors
• Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote)

24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest Neighbor Classification

25

• Compute distance between unknown record and all training data:
• Euclidean distance
• Find k neatest neighbors
• Determine the class from nearest neighbor list
• take the majority vote of class labels among the k-nearest neighbors
• weight the vote according to distance
• weight factor, w = 1/d2, w=exp(-d2/t), etc

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

1 nearest-neighbor

Voronoi Diagram (nearest neighbor regions)

• Voronoidiagram
• The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites.
• The Voronoi nodes are the points equidistant to three (or more) sites.

26

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k of k-nn

27

• Choosing the value of k:
• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from other classes

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Normalization of attributes
• Scaling issues
• Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from \$10K to \$1M
• Solution: Normalize the vectors to unit length
• Problem with Euclidean measure:
• High dimensional data curse of dimensionality

28

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest neighbor Classification

k-NN classifiers are lazy learners

It does not build models explicitly

Robust to noisy data by averaging k-nearest neighbors

Unlike eager learners such as decision tree induction and rule-based systems

Classifying unknown records are relatively expensive

29

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-dimensional tree (kd-tree)
• efficient way of nearest neighbor searches
• space-partitioning data structure for organizing points in a k-dimensional space.

30

Example: 2d-tree
• A recursive space partitioning tree.
• Partition along x and y axis in an alternating fashion.
• Each internal node stores the splitting node along x (or y).
• e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

31

k-dimensional tree (kd-tree)
• Searching for a nearest neighbor of p in a kd-tree
• Move down the tree recursively
• Reach a leaf  “current nearest”
• Unwind the recursion,
• check the parent’s other children, is there a
• intersection with potential nearer neighbor ?
• if no, go up to further level
• if yes, check the children
• if t is closer to p, t  “current nearest”
• Repeat until reach the root

p

Current nearest

Other children

parent

34

Complexity

Building a static kd-tree from n points takes O(n log 2n) (can be improved to O(n log n) ) time if an O(n log n) sort is used to compute the median at each level.

Inserting a new point into a balanced kd-tree takes O(log n) time.

Removing a point from a balanced kd-tree takes O(log n) time.

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classification

69

• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayes Classifier

70

• A probabilistic framework for solving classification problems
• Conditional Probability:
• Bayes theorem:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Bayes Theorem

71

• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time P(S|M)
• Prior probability of any patient having meningitis is 1/50,000 P(M)
• Prior probability of any patient having stiff neck is 1/2 P(S)
• If a patient has stiff neck, what’s the probability he/she has meningitis?
• Informally, this can be written as

posteriori = likelihood x prior / evidence

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers

72

• Consider each attribute and class label as random variables
• Given a record with attributes (A1, A2,…,An)
• Goal is to predict class C, C=c1, or c2, or …..
• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers

73

• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
• Choose value of C that maximizes P(C | A1, A2, …, An)
• Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
• How to estimate likelihood P(A1, A2, …, An | C )?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayes Classifier

74

• Assume independence among attributes Ai when class is given:
• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
• greatly reduces the computation cost: Only counts the class distribution
• Can estimate P(Ai|Cj) for all Ai and Cj.
• New point is classified to Cj if P(Cj) P(Ai|Cj) is maximal.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

75

• For Class: P(C) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

• For discrete attributes: P(Ai | Ck) = |Aik|/ Nck

where |Aik| is number of instances having attribute Ai and belongs to class Ck

• Examples:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

76

For continuous attributes:

• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ)
• Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci)

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

77

• Normal distribution:

One for each (Ai,Ci) pair

• e.g, for (Income, Class=No):
• If Class=No
• sample mean = 110
• sample variance = 2975

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Naïve Bayes Classifier

Given a Test Record:

• P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024
• P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X)=> Class = No

78

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Avoiding the 0-Probability Problem
• E.g. Suppose a dataset with 1000 tuples,
• income=low (0),
• income= medium (990),
• income = high (10),
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case, c = 3

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

c: number of classes

p: prior probability

m: parameter

79

• If one of the conditional probability is zero, then the entire expression becomes zero
• Probability estimation:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Easy to implement
• Good results obtained in most of the cases
• Robust to isolated noise points
• Robust to irrelevant attributes
• Handle missing values by ignoring the instance during probability estimate calculations

80

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Independence assumption may not hold for some attributes
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
• Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian Classifier
• loss of accuracy
• How to deal with these dependencies?
• Bayesian Belief Networks (BBN)

81

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Y

Z

P

Bayesian Belief Networks
• Bayesian belief network allows a subset of the variables conditionally independent
• A graphical model of causal relationships (directed acyclic graph)
• Represents dependency among the variables
• Gives a specification of joint probability distribution
• Nodes: random variables
• X and Y are the parents of Z, and Y is the parent of P
• No dependency between Z and P
• Has no loops or cycles

X

82

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example

Family

History

Smoker

The conditional probability table (CPT) for variable LungCancer:

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

LC

0.8

0.7

0.5

0.1

~LC

0.2

0.5

0.3

0.9

LungCancer

Emphysema

CPT shows the conditional probability for each possible combination of its parents

P(LungCancer = YES | FM = YES, S = YES) =0.8

P(LungCancer = NO | FM = NO, S = NO) =0.9

Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT:

PositiveXRay

Dyspnea

Bayesian Belief Networks

83

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example

If CPT is known, BBN can be used to

Compute the joint probability of a tuple :

P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N)

Take a node as an “output”, representing a class label attribute

e.g., PositiveXRay class attribute

Predict the class of a tuple

e.g., PXR= ? given FS=N, S=Y, LC=N

compute P(PXY=Y | FS=N, S=Y, LC=N) =a

P(PXY=N | FS=N, S=Y, LC=N) =b

if a > b

PositiveXRay= Yes

Family

History

Smoker

LungCancer

Emphysema

PositiveXRay

Dyspnea

Class

attribute

Bayesian Belief Networks

84

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Training Bayesian Networks by training data instances
• Several scenarios:
• Given both the network structure and all variables observable: learn only the CPTs
• Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning
• Network structure unknown, all variables observable: search through the model space to reconstruct network topology
• Unknown structure, all hidden variables: No good algorithms known for this purpose

85

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References

Kd-tree

-- Kd-Trees: Another Range Searching Trees

http://www.cs.fsu.edu/~lifeifei/cis5930/kdtree.pdf

http://www.cs.uu.nl/docs/vakken/ga/slides5.pdf

http://3glab.cs.nthu.edu.tw/~spoon/courses/CS631100/Lecture06_handout.pdf

-- Animations of KD-tree searches

http://www.cs.cmu.edu/~awm/animations/kdtree/

Bayesian networks

-- BOOK: D. Heckerman, Bayesian networks for data mining

-- A Tutorial on Learning With Bayesian Networks

http://research.microsoft.com/pubs/69588/tr-95-06.pdf

-- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Questions

What criteria can you use?

How to compare the performance of classifiers?

How k-nearest neighbors can be used for classification?

How to use Naïve Bayesian classifier?

What are the disadvantages of Naïve Bayesian classifier?

How does Bayesian Network work?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining