1 / 85

# AMCS/CS 340: Data Mining - PowerPoint PPT Presentation

Classification: Evaluation. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' AMCS/CS 340: Data Mining' - dunne

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
AMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

• Metrics for Performance Evaluation

• How to evaluate the performance of a model?

• Methods for Performance Evaluation

• How to obtain reliable estimates?

• Methods for Model Comparison

• How to compare the relative performance among competing models?

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Focus on the predictive capability of a model

• Rather than how fast it takes to classify or build models, scalability, etc.

• Confusion Matrix:

• Most widely-used metric:

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Consider a 2-class problem

• Number of Class 0 examples = 9990

• Number of Class 1 examples = 10

Unbalanced classes

• If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 %

• Accuracy is misleading because model does not detect any class 1 example

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

5

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Metrics for Performance Evaluation

• How to evaluate the performance of a model?

• Methods for Performance Evaluation

• How to obtain reliable estimates?

• Methods for Model Comparison

• How to compare the relative performance among competing models?

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other factors besides the learning algorithm:

• Class distribution

• Cost of misclassification

• Size of training and test sets

7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Holdout

Reserve 2/3 for training and 1/3 for testing

• Random subsampling

Repeated holdout

• Cross validation

• Partition data into k disjoint subsets

• k-fold: train on k-1 partitions, test on the remaining one

• Leave-one-out: k=n

• Bootstrap

Sampling with replacement

8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Metrics for Performance Evaluation

• How to evaluate the performance of a model?

• Methods for Performance Evaluation

• How to obtain reliable estimates?

• Methods for Model Comparison

• How to compare the relative performance among competing models?

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of algorithm, or sampledistribution changes the location of the point

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

10

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• No model consistently outperform the other

• M1 is better for small FPR

• M2 is better for large FPR

• Area Under the ROC curve

• Ideal:

• Area = 1

• Random guess:

• Area = 0.5

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Posterior probability of test instance x

Threshold: t

# of + >= t

# of - >= t

ROC Curve:

13

• Prediction can be regarded as a Bernoulli trial

• A Bernoulli trial has 2 possible outcomes

• Possible outcomes for prediction: correct or wrong

• Collection of Bernoulli trials has a Binomial distribution:

• x  Bin(N, p) x: number of correct predictions

• e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25

• Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)?

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Area = 1 - 

• For large test sets (N > 30),

acchas a normal distribution with mean p and variance p(1-p)/N

• Confidence Interval for p:

Z/2

Z1-  /2

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:

• N=100, acc = 0.8

• Let 1- = 0.95 (95% confidence)

• From probability table, Z/2=1.96

Standard Normal distribution

16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Given two models:

• Model M1: accuracy = 85%, tested on 30 instances

• Model M2: accuracy = 75%, tested on 5000 instances

• Can we say M1 is better than M2?

• How much confidence can we place on accuracy of M1 and M2?

• Can the difference in performance measure be explained as a result of random fluctuations in the test set?

17

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Given two models, say M1 and M2, which is better?

• M1 is tested on D1 (size=n1), found error rate = e1

• M2 is tested on D2 (size=n2), found error rate = e2

• Assume D1 and D2 are independent

• If n1 and n2 are sufficiently large, then

• Approximate of variance (Binomial distribution):

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• To test if performance difference is statistically significant:

• d = e1 – e2

• where dt is the true difference

• Since D1 and D2 are independent, their variance adds up:

• At (1-) confidence level,

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25

• d = |e2 – e1| = 0.1 (2-sided test)

• At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test Record

Training Records

Choose k of the “nearest” records

Nearest Neighbor Classifiers

22

• Basic idea:

• If it walks like a duck, quacks like a duck, then it’s probably a duck

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

K-nearest neighbors of a record x are data points that have the k smallest distance to x

23

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Requires three things

• The set of stored records

• Distance Metric to compute distance between records

• The value of k, the number of nearest neighbors to retrieve

• To classify an unknown record:

• Compute distance to other training records

• Identify k nearest neighbors

• Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote)

24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

25

• Compute distance between unknown record and all training data:

• Euclidean distance

• Find k neatest neighbors

• Determine the class from nearest neighbor list

• take the majority vote of class labels among the k-nearest neighbors

• weight the vote according to distance

• weight factor, w = 1/d2, w=exp(-d2/t), etc

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Voronoi Diagram (nearest neighbor regions)

• Voronoidiagram

• The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites.

• The Voronoi nodes are the points equidistant to three (or more) sites.

26

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k of k-nn

27

• Choosing the value of k:

• If k is too small, sensitive to noise points

• If k is too large, neighborhood may include points from other classes

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Scaling issues

• Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

• Example:

• height of a person may vary from 1.5m to 1.8m

• weight of a person may vary from 90lb to 300lb

• income of a person may vary from \$10K to \$1M

• Solution: Normalize the vectors to unit length

• Problem with Euclidean measure:

• High dimensional data curse of dimensionality

28

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest neighbor Classification

k-NN classifiers are lazy learners

It does not build models explicitly

Robust to noisy data by averaging k-nearest neighbors

Unlike eager learners such as decision tree induction and rule-based systems

Classifying unknown records are relatively expensive

29

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-dimensional tree (kd-tree)

• efficient way of nearest neighbor searches

• space-partitioning data structure for organizing points in a k-dimensional space.

30

• A recursive space partitioning tree.

• Partition along x and y axis in an alternating fashion.

• Each internal node stores the splitting node along x (or y).

• e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

31

k-dimensional tree (kd-tree)

• Searching for a nearest neighbor of p in a kd-tree

• Move down the tree recursively

• Reach a leaf  “current nearest”

• Unwind the recursion,

• check the parent’s other children, is there a

• intersection with potential nearer neighbor ?

• if no, go up to further level

• if yes, check the children

• if t is closer to p, t  “current nearest”

• Repeat until reach the root

p

Current nearest

Other children

parent

34

Building a static kd-tree from n points takes O(n log 2n) (can be improved to O(n log n) ) time if an O(n log n) sort is used to compute the median at each level.

Inserting a new point into a balanced kd-tree takes O(log n) time.

Removing a point from a balanced kd-tree takes O(log n) time.

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

69

• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities

• Foundation: Based on Bayes’ Theorem.

• Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers

• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data

• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

70

• A probabilistic framework for solving classification problems

• Conditional Probability:

• Bayes theorem:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

71

• Given:

• A doctor knows that meningitis causes stiff neck 50% of the time P(S|M)

• Prior probability of any patient having meningitis is 1/50,000 P(M)

• Prior probability of any patient having stiff neck is 1/2 P(S)

• If a patient has stiff neck, what’s the probability he/she has meningitis?

• Informally, this can be written as

posteriori = likelihood x prior / evidence

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

72

• Consider each attribute and class label as random variables

• Given a record with attributes (A1, A2,…,An)

• Goal is to predict class C, C=c1, or c2, or …..

• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

73

• Approach:

• compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

• Choose value of C that maximizes P(C | A1, A2, …, An)

• Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)

• How to estimate likelihood P(A1, A2, …, An | C )?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

74

• Assume independence among attributes Ai when class is given:

• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

• greatly reduces the computation cost: Only counts the class distribution

• Can estimate P(Ai|Cj) for all Ai and Cj.

• New point is classified to Cj if P(Cj) P(Ai|Cj) is maximal.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

75

• For Class: P(C) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

• For discrete attributes: P(Ai | Ck) = |Aik|/ Nck

where |Aik| is number of instances having attribute Ai and belongs to class Ck

• Examples:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

76

For continuous attributes:

• Probability density estimation:

• Assume attribute follows a normal distribution

• Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ)

• Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci)

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

77

• Normal distribution:

One for each (Ai,Ci) pair

• e.g, for (Income, Class=No):

• If Class=No

• sample mean = 110

• sample variance = 2975

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Given a Test Record:

• P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024

• P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X)=> Class = No

78

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• E.g. Suppose a dataset with 1000 tuples,

• income=low (0),

• income= medium (990),

• income = high (10),

• Use Laplacian correction (or Laplacian estimator)

• Adding 1 to each case, c = 3

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

c: number of classes

p: prior probability

m: parameter

79

• If one of the conditional probability is zero, then the entire expression becomes zero

• Probability estimation:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Easy to implement

• Good results obtained in most of the cases

• Robust to isolated noise points

• Robust to irrelevant attributes

• Handle missing values by ignoring the instance during probability estimate calculations

80

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

• Independence assumption may not hold for some attributes

• Practically, dependencies exist among variables

• E.g., hospitals: patients: Profile: age, family history, etc.

• Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

• Dependencies among these cannot be modeled by Naïve Bayesian Classifier

• loss of accuracy

• How to deal with these dependencies?

• Bayesian Belief Networks (BBN)

81

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Z

P

Bayesian Belief Networks

• Bayesian belief network allows a subset of the variables conditionally independent

• A graphical model of causal relationships (directed acyclic graph)

• Represents dependency among the variables

• Gives a specification of joint probability distribution

• Nodes: random variables

• X and Y are the parents of Z, and Y is the parent of P

• No dependency between Z and P

• Has no loops or cycles

X

82

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Family

History

Smoker

The conditional probability table (CPT) for variable LungCancer:

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

LC

0.8

0.7

0.5

0.1

~LC

0.2

0.5

0.3

0.9

LungCancer

Emphysema

CPT shows the conditional probability for each possible combination of its parents

P(LungCancer = YES | FM = YES, S = YES) =0.8

P(LungCancer = NO | FM = NO, S = NO) =0.9

Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT:

PositiveXRay

Dyspnea

Bayesian Belief Networks

83

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

If CPT is known, BBN can be used to

Compute the joint probability of a tuple :

P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N)

Take a node as an “output”, representing a class label attribute

e.g., PositiveXRay class attribute

Predict the class of a tuple

e.g., PXR= ? given FS=N, S=Y, LC=N

compute P(PXY=Y | FS=N, S=Y, LC=N) =a

P(PXY=N | FS=N, S=Y, LC=N) =b

if a > b

PositiveXRay= Yes

Family

History

Smoker

LungCancer

Emphysema

PositiveXRay

Dyspnea

Class

attribute

Bayesian Belief Networks

84

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Training Bayesian Networks by training data instances

• Several scenarios:

• Given both the network structure and all variables observable: learn only the CPTs

• Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning

• Network structure unknown, all variables observable: search through the model space to reconstruct network topology

• Unknown structure, all hidden variables: No good algorithms known for this purpose

85

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Kd-tree

-- Kd-Trees: Another Range Searching Trees

http://www.cs.fsu.edu/~lifeifei/cis5930/kdtree.pdf

http://www.cs.uu.nl/docs/vakken/ga/slides5.pdf

http://3glab.cs.nthu.edu.tw/~spoon/courses/CS631100/Lecture06_handout.pdf

-- Animations of KD-tree searches

http://www.cs.cmu.edu/~awm/animations/kdtree/

Bayesian networks

-- BOOK: D. Heckerman, Bayesian networks for data mining

-- A Tutorial on Learning With Bayesian Networks

http://research.microsoft.com/pubs/69588/tr-95-06.pdf

-- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What criteria can you use?

How to compare the performance of classifiers?

How k-nearest neighbors can be used for classification?

How to use Naïve Bayesian classifier?

What are the disadvantages of Naïve Bayesian classifier?

How does Bayesian Network work?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining