Classification evaluation
Download
1 / 85

AMCS/CS 340: Data Mining - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Classification: Evaluation. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' AMCS/CS 340: Data Mining' - dunne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Classification evaluation

Classification: Evaluation

AMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology


Model evaluation
Model Evaluation

  • Metrics for Performance Evaluation

    • How to evaluate the performance of a model?

  • Methods for Performance Evaluation

    • How to obtain reliable estimates?

  • Methods for Model Comparison

    • How to compare the relative performance among competing models?

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Metrics for performance evaluation
Metrics for Performance Evaluation

  • Focus on the predictive capability of a model

    • Rather than how fast it takes to classify or build models, scalability, etc.

  • Confusion Matrix:

  • Most widely-used metric:

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Limitation of accuracy
Limitation of Accuracy

  • Consider a 2-class problem

    • Number of Class 0 examples = 9990

    • Number of Class 1 examples = 10

      Unbalanced classes

  • If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 %

  • Accuracy is misleading because model does not detect any class 1 example

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Other measures
Other Measures

5

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Model evaluation1
Model Evaluation

  • Metrics for Performance Evaluation

    • How to evaluate the performance of a model?

  • Methods for Performance Evaluation

    • How to obtain reliable estimates?

  • Methods for Model Comparison

    • How to compare the relative performance among competing models?

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Methods for performance evaluation
Methods for Performance Evaluation

  • How to obtain a reliable estimate of performance?

  • Performance of a model may depend on other factors besides the learning algorithm:

    • Class distribution

    • Cost of misclassification

    • Size of training and test sets

7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Methods of estimation
Methods of Estimation

  • Holdout

    Reserve 2/3 for training and 1/3 for testing

  • Random subsampling

    Repeated holdout

  • Cross validation

    • Partition data into k disjoint subsets

    • k-fold: train on k-1 partitions, test on the remaining one

    • Leave-one-out: k=n

  • Bootstrap

    Sampling with replacement

8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Model evaluation2
Model Evaluation

  • Metrics for Performance Evaluation

    • How to evaluate the performance of a model?

  • Methods for Performance Evaluation

    • How to obtain reliable estimates?

  • Methods for Model Comparison

    • How to compare the relative performance among competing models?

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Roc receiver operating characteristic
ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of algorithm, or sampledistribution changes the location of the point

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

10


Roc curve
ROC Curve

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Using roc for model comparison
Using ROC for Model Comparison

  • No model consistently outperform the other

    • M1 is better for small FPR

    • M2 is better for large FPR

  • Area Under the ROC curve

    • Ideal:

      • Area = 1

    • Random guess:

      • Area = 0.5

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


How to construct an roc curve
How to construct an ROC curve

Posterior probability of test instance x

Threshold: t

# of + >= t

# of - >= t

ROC Curve:

13


Confidence interval for accuracy
Confidence Interval for Accuracy

  • Prediction can be regarded as a Bernoulli trial

    • A Bernoulli trial has 2 possible outcomes

    • Possible outcomes for prediction: correct or wrong

    • Collection of Bernoulli trials has a Binomial distribution:

    • x  Bin(N, p) x: number of correct predictions

      • e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25

  • Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)?

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Confidence interval for accuracy1
Confidence Interval for Accuracy

Area = 1 - 

  • For large test sets (N > 30),

    acchas a normal distribution with mean p and variance p(1-p)/N

  • Confidence Interval for p:

Z/2

Z1-  /2

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Confidence interval for accuracy2
Confidence Interval for Accuracy

  • Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:

    • N=100, acc = 0.8

    • Let 1- = 0.95 (95% confidence)

    • From probability table, Z/2=1.96

Standard Normal distribution

16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Test of significance
Test of Significance

  • Given two models:

    • Model M1: accuracy = 85%, tested on 30 instances

    • Model M2: accuracy = 75%, tested on 5000 instances

  • Can we say M1 is better than M2?

    • How much confidence can we place on accuracy of M1 and M2?

    • Can the difference in performance measure be explained as a result of random fluctuations in the test set?

17

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Comparing performance of 2 models
Comparing Performance of 2 Models

  • Given two models, say M1 and M2, which is better?

    • M1 is tested on D1 (size=n1), found error rate = e1

    • M2 is tested on D2 (size=n2), found error rate = e2

    • Assume D1 and D2 are independent

    • If n1 and n2 are sufficiently large, then

    • Approximate of variance (Binomial distribution):

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Comparing performance of 2 models1
Comparing Performance of 2 Models

  • To test if performance difference is statistically significant:

  • d = e1 – e2

    • where dt is the true difference

    • Since D1 and D2 are independent, their variance adds up:

    • At (1-) confidence level,

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


An illustrative example
An Illustrative Example

  • Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25

  • d = |e2 – e1| = 0.1 (2-sided test)

  • At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Classification techniques

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Nearest neighbor classifiers

Compute Distance

Test Record

Training Records

Choose k of the “nearest” records

Nearest Neighbor Classifiers

22

  • Basic idea:

    • If it walks like a duck, quacks like a duck, then it’s probably a duck

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Definition of nearest neighbor
Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

23

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Nearest neighbor classifiers1
Nearest Neighbor Classifiers

  • Requires three things

    • The set of stored records

    • Distance Metric to compute distance between records

    • The value of k, the number of nearest neighbors to retrieve

  • To classify an unknown record:

    • Compute distance to other training records

    • Identify k nearest neighbors

    • Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote)

24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


K nearest neighbor classification
k Nearest Neighbor Classification

25

  • Compute distance between unknown record and all training data:

    • Euclidean distance

  • Find k neatest neighbors

  • Determine the class from nearest neighbor list

    • take the majority vote of class labels among the k-nearest neighbors

    • weight the vote according to distance

      • weight factor, w = 1/d2, w=exp(-d2/t), etc

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


1 nearest neighbor
1 nearest-neighbor

Voronoi Diagram (nearest neighbor regions)

  • Voronoidiagram

  • The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites.

  • The Voronoi nodes are the points equidistant to three (or more) sites.

26

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


K of k nn
k of k-nn

27

  • Choosing the value of k:

    • If k is too small, sensitive to noise points

    • If k is too large, neighborhood may include points from other classes

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Normalization of attributes
Normalization of attributes

  • Scaling issues

    • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

    • Example:

      • height of a person may vary from 1.5m to 1.8m

      • weight of a person may vary from 90lb to 300lb

      • income of a person may vary from $10K to $1M

  • Solution: Normalize the vectors to unit length

  • Problem with Euclidean measure:

    • High dimensional data curse of dimensionality

28

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


K nearest neighbor classification1
k Nearest neighbor Classification

k-NN classifiers are lazy learners

It does not build models explicitly

Robust to noisy data by averaging k-nearest neighbors

Unlike eager learners such as decision tree induction and rule-based systems

Classifying unknown records are relatively expensive

29

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


K dimensional tree kd tree
k-dimensional tree (kd-tree)

  • efficient way of nearest neighbor searches

  • space-partitioning data structure for organizing points in a k-dimensional space.

30


Example 2d tree
Example: 2d-tree

  • A recursive space partitioning tree.

    • Partition along x and y axis in an alternating fashion.

    • Each internal node stores the splitting node along x (or y).

    • e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

31


K dimensional tree kd tree1
k-dimensional tree (kd-tree)

  • Searching for a nearest neighbor of p in a kd-tree

  • Start with the root node

  • Move down the tree recursively

  • Reach a leaf  “current nearest”

  • Unwind the recursion,

    • check the parent’s other children, is there a

    • intersection with potential nearer neighbor ?

      • if no, go up to further level

      • if yes, check the children

        • if t is closer to p, t  “current nearest”

    • Repeat until reach the root

p

Current nearest

Other children

parent

34


Complexity
Complexity

Building a static kd-tree from n points takes O(n log 2n) (can be improved to O(n log n) ) time if an O(n log n) sort is used to compute the median at each level.

Inserting a new point into a balanced kd-tree takes O(log n) time.

Removing a point from a balanced kd-tree takes O(log n) time.

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.


Classification techniques1

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian classification
Bayesian Classification

69

  • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities

  • Foundation: Based on Bayes’ Theorem.

  • Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers

  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data

  • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayes classifier
Bayes Classifier

70

  • A probabilistic framework for solving classification problems

  • Conditional Probability:

  • Bayes theorem:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Example of bayes theorem
Example of Bayes Theorem

71

  • Given:

    • A doctor knows that meningitis causes stiff neck 50% of the time P(S|M)

    • Prior probability of any patient having meningitis is 1/50,000 P(M)

    • Prior probability of any patient having stiff neck is 1/2 P(S)

  • If a patient has stiff neck, what’s the probability he/she has meningitis?

  • Informally, this can be written as

    posteriori = likelihood x prior / evidence

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian classifiers
Bayesian Classifiers

72

  • Consider each attribute and class label as random variables

  • Given a record with attributes (A1, A2,…,An)

    • Goal is to predict class C, C=c1, or c2, or …..

    • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

  • Can we estimate P(C| A1, A2,…,An ) directly from data?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian classifiers1
Bayesian Classifiers

73

  • Approach:

    • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

    • Choose value of C that maximizes P(C | A1, A2, …, An)

    • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)

  • How to estimate likelihood P(A1, A2, …, An | C )?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Na ve bayes classifier
Naïve Bayes Classifier

74

  • Assume independence among attributes Ai when class is given:

    • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

    • greatly reduces the computation cost: Only counts the class distribution

  • Can estimate P(Ai|Cj) for all Ai and Cj.

  • New point is classified to Cj if P(Cj) P(Ai|Cj) is maximal.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


How to estimate probabilities from data
How to Estimate Probabilities from Data?

75

  • For Class: P(C) = Nc/N

    e.g., P(No) = 7/10, P(Yes) = 3/10

  • For discrete attributes: P(Ai | Ck) = |Aik|/ Nck

    where |Aik| is number of instances having attribute Ai and belongs to class Ck

    • Examples:

      P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


How to estimate probabilities from data1
How to Estimate Probabilities from Data?

76

For continuous attributes:

  • Probability density estimation:

    • Assume attribute follows a normal distribution

    • Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ)

    • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci)

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


How to Estimate Probabilities from Data?

77

  • Normal distribution:

    One for each (Ai,Ci) pair

  • e.g, for (Income, Class=No):

    • If Class=No

      • sample mean = 110

      • sample variance = 2975

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Example of na ve bayes classifier
Example of Naïve Bayes Classifier

Given a Test Record:

  • P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024

  • P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0

    Since P(X|No)P(No) > P(X|Yes)P(Yes)

    Therefore P(No|X) > P(Yes|X)=> Class = No

78

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Avoiding the 0 probability problem
Avoiding the 0-Probability Problem

  • E.g. Suppose a dataset with 1000 tuples,

  • income=low (0),

  • income= medium (990),

  • income = high (10),

  • Use Laplacian correction (or Laplacian estimator)

    • Adding 1 to each case, c = 3

      Prob(income = low) = 1/1003

      Prob(income = medium) = 991/1003

      Prob(income = high) = 11/1003

c: number of classes

p: prior probability

m: parameter

79

  • If one of the conditional probability is zero, then the entire expression becomes zero

  • Probability estimation:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Na ve bayesian classifier comments
Naïve Bayesian Classifier: Comments

  • Advantages

    • Easy to implement

    • Good results obtained in most of the cases

    • Robust to isolated noise points

    • Robust to irrelevant attributes

    • Handle missing values by ignoring the instance during probability estimate calculations

80

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Na ve bayesian classifier comments1
Naïve Bayesian Classifier: Comments

  • Disadvantages

    • Independence assumption may not hold for some attributes

    • Practically, dependencies exist among variables

      • E.g., hospitals: patients: Profile: age, family history, etc.

      • Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

      • Dependencies among these cannot be modeled by Naïve Bayesian Classifier

    • loss of accuracy

  • How to deal with these dependencies?

    • Bayesian Belief Networks (BBN)

81

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian belief networks

Y

Z

P

Bayesian Belief Networks

  • Bayesian belief network allows a subset of the variables conditionally independent

  • A graphical model of causal relationships (directed acyclic graph)

    • Represents dependency among the variables

    • Gives a specification of joint probability distribution

  • Nodes: random variables

  • Links: dependency

  • X and Y are the parents of Z, and Y is the parent of P

  • No dependency between Z and P

  • Has no loops or cycles

X

82

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian belief network an example
Bayesian Belief Network: An Example

Family

History

Smoker

The conditional probability table (CPT) for variable LungCancer:

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

LC

0.8

0.7

0.5

0.1

~LC

0.2

0.5

0.3

0.9

LungCancer

Emphysema

CPT shows the conditional probability for each possible combination of its parents

P(LungCancer = YES | FM = YES, S = YES) =0.8

P(LungCancer = NO | FM = NO, S = NO) =0.9

Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT:

PositiveXRay

Dyspnea

Bayesian Belief Networks

83

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Bayesian belief network an example1
Bayesian Belief Network: An Example

If CPT is known, BBN can be used to

Compute the joint probability of a tuple :

P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N)

Take a node as an “output”, representing a class label attribute

e.g., PositiveXRay class attribute

Predict the class of a tuple

e.g., PXR= ? given FS=N, S=Y, LC=N

compute P(PXY=Y | FS=N, S=Y, LC=N) =a

P(PXY=N | FS=N, S=Y, LC=N) =b

if a > b

PositiveXRay= Yes

Family

History

Smoker

LungCancer

Emphysema

PositiveXRay

Dyspnea

Class

attribute

Bayesian Belief Networks

84

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Training bayesian networks by training data instances
Training Bayesian Networks by training data instances

  • Several scenarios:

    • Given both the network structure and all variables observable: learn only the CPTs

    • Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning

    • Network structure unknown, all variables observable: search through the model space to reconstruct network topology

    • Unknown structure, all hidden variables: No good algorithms known for this purpose

85

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


References
References

Kd-tree

-- Kd-Trees: Another Range Searching Trees

http://www.cs.fsu.edu/~lifeifei/cis5930/kdtree.pdf

http://www.cs.uu.nl/docs/vakken/ga/slides5.pdf

http://3glab.cs.nthu.edu.tw/~spoon/courses/CS631100/Lecture06_handout.pdf

-- Animations of KD-tree searches

http://www.cs.cmu.edu/~awm/animations/kdtree/

Bayesian networks

-- BOOK: D. Heckerman, Bayesian networks for data mining

-- A Tutorial on Learning With Bayesian Networks

http://research.microsoft.com/pubs/69588/tr-95-06.pdf

-- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Questions
Questions

How to evaluate your classifier?

What criteria can you use?

How to compare the performance of classifiers?

How k-nearest neighbors can be used for classification?

How to use Naïve Bayesian classifier?

What are the disadvantages of Naïve Bayesian classifier?

How does Bayesian Network work?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


ad