classification evaluation
Download
Skip this Video
Download Presentation
AMCS/CS 340: Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 85

AMCS/CS 340: Data Mining - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Classification: Evaluation. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' AMCS/CS 340: Data Mining' - dunne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
classification evaluation
Classification: EvaluationAMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

model evaluation
Model Evaluation
  • Metrics for Performance Evaluation
    • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
    • How to obtain reliable estimates?
  • Methods for Model Comparison
    • How to compare the relative performance among competing models?

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

metrics for performance evaluation
Metrics for Performance Evaluation
  • Focus on the predictive capability of a model
    • Rather than how fast it takes to classify or build models, scalability, etc.
  • Confusion Matrix:
  • Most widely-used metric:

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

limitation of accuracy
Limitation of Accuracy
  • Consider a 2-class problem
    • Number of Class 0 examples = 9990
    • Number of Class 1 examples = 10

Unbalanced classes

  • If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 %
  • Accuracy is misleading because model does not detect any class 1 example

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

other measures
Other Measures

5

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

model evaluation1
Model Evaluation
  • Metrics for Performance Evaluation
    • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
    • How to obtain reliable estimates?
  • Methods for Model Comparison
    • How to compare the relative performance among competing models?

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

methods for performance evaluation
Methods for Performance Evaluation
  • How to obtain a reliable estimate of performance?
  • Performance of a model may depend on other factors besides the learning algorithm:
    • Class distribution
    • Cost of misclassification
    • Size of training and test sets

7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

methods of estimation
Methods of Estimation
  • Holdout

Reserve 2/3 for training and 1/3 for testing

  • Random subsampling

Repeated holdout

  • Cross validation
    • Partition data into k disjoint subsets
    • k-fold: train on k-1 partitions, test on the remaining one
    • Leave-one-out: k=n
  • Bootstrap

Sampling with replacement

8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

model evaluation2
Model Evaluation
  • Metrics for Performance Evaluation
    • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
    • How to obtain reliable estimates?
  • Methods for Model Comparison
    • How to compare the relative performance among competing models?

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

roc receiver operating characteristic
ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of algorithm, or sampledistribution changes the location of the point

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

10

roc curve
ROC Curve

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

using roc for model comparison
Using ROC for Model Comparison
  • No model consistently outperform the other
    • M1 is better for small FPR
    • M2 is better for large FPR
  • Area Under the ROC curve
    • Ideal:
      • Area = 1
    • Random guess:
      • Area = 0.5

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

how to construct an roc curve
How to construct an ROC curve

Posterior probability of test instance x

Threshold: t

# of + >= t

# of - >= t

ROC Curve:

13

confidence interval for accuracy
Confidence Interval for Accuracy
  • Prediction can be regarded as a Bernoulli trial
    • A Bernoulli trial has 2 possible outcomes
    • Possible outcomes for prediction: correct or wrong
    • Collection of Bernoulli trials has a Binomial distribution:
    • x  Bin(N, p) x: number of correct predictions
      • e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25
  • Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)?

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

confidence interval for accuracy1
Confidence Interval for Accuracy

Area = 1 - 

  • For large test sets (N > 30),

acchas a normal distribution with mean p and variance p(1-p)/N

  • Confidence Interval for p:

Z/2

Z1-  /2

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

confidence interval for accuracy2
Confidence Interval for Accuracy
  • Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:
    • N=100, acc = 0.8
    • Let 1- = 0.95 (95% confidence)
    • From probability table, Z/2=1.96

Standard Normal distribution

16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

test of significance
Test of Significance
  • Given two models:
    • Model M1: accuracy = 85%, tested on 30 instances
    • Model M2: accuracy = 75%, tested on 5000 instances
  • Can we say M1 is better than M2?
    • How much confidence can we place on accuracy of M1 and M2?
    • Can the difference in performance measure be explained as a result of random fluctuations in the test set?

17

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

comparing performance of 2 models
Comparing Performance of 2 Models
  • Given two models, say M1 and M2, which is better?
    • M1 is tested on D1 (size=n1), found error rate = e1
    • M2 is tested on D2 (size=n2), found error rate = e2
    • Assume D1 and D2 are independent
    • If n1 and n2 are sufficiently large, then
    • Approximate of variance (Binomial distribution):

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

comparing performance of 2 models1
Comparing Performance of 2 Models
  • To test if performance difference is statistically significant:
  • d = e1 – e2
    • where dt is the true difference
    • Since D1 and D2 are independent, their variance adds up:
    • At (1-) confidence level,

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

an illustrative example
An Illustrative Example
  • Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25
  • d = |e2 – e1| = 0.1 (2-sided test)
  • At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

classification techniques
Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

nearest neighbor classifiers

Compute Distance

Test Record

Training Records

Choose k of the “nearest” records

Nearest Neighbor Classifiers

22

  • Basic idea:
    • If it walks like a duck, quacks like a duck, then it’s probably a duck

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

definition of nearest neighbor
Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

23

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

nearest neighbor classifiers1
Nearest Neighbor Classifiers
  • Requires three things
    • The set of stored records
    • Distance Metric to compute distance between records
    • The value of k, the number of nearest neighbors to retrieve
  • To classify an unknown record:
    • Compute distance to other training records
    • Identify k nearest neighbors
    • Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote)

24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k nearest neighbor classification
k Nearest Neighbor Classification

25

  • Compute distance between unknown record and all training data:
    • Euclidean distance
  • Find k neatest neighbors
  • Determine the class from nearest neighbor list
    • take the majority vote of class labels among the k-nearest neighbors
    • weight the vote according to distance
      • weight factor, w = 1/d2, w=exp(-d2/t), etc

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

1 nearest neighbor
1 nearest-neighbor

Voronoi Diagram (nearest neighbor regions)

  • Voronoidiagram
  • The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites.
  • The Voronoi nodes are the points equidistant to three (or more) sites.

26

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k of k nn
k of k-nn

27

  • Choosing the value of k:
    • If k is too small, sensitive to noise points
    • If k is too large, neighborhood may include points from other classes

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

normalization of attributes
Normalization of attributes
  • Scaling issues
    • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
    • Example:
      • height of a person may vary from 1.5m to 1.8m
      • weight of a person may vary from 90lb to 300lb
      • income of a person may vary from $10K to $1M
  • Solution: Normalize the vectors to unit length
  • Problem with Euclidean measure:
    • High dimensional data curse of dimensionality

28

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k nearest neighbor classification1
k Nearest neighbor Classification

k-NN classifiers are lazy learners

It does not build models explicitly

Robust to noisy data by averaging k-nearest neighbors

Unlike eager learners such as decision tree induction and rule-based systems

Classifying unknown records are relatively expensive

29

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k dimensional tree kd tree
k-dimensional tree (kd-tree)
  • efficient way of nearest neighbor searches
  • space-partitioning data structure for organizing points in a k-dimensional space.

30

example 2d tree
Example: 2d-tree
  • A recursive space partitioning tree.
    • Partition along x and y axis in an alternating fashion.
    • Each internal node stores the splitting node along x (or y).
    • e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

31

k dimensional tree kd tree1
k-dimensional tree (kd-tree)
  • Searching for a nearest neighbor of p in a kd-tree
  • Start with the root node
  • Move down the tree recursively
  • Reach a leaf  “current nearest”
  • Unwind the recursion,
    • check the parent’s other children, is there a
    • intersection with potential nearer neighbor ?
      • if no, go up to further level
      • if yes, check the children
        • if t is closer to p, t  “current nearest”
    • Repeat until reach the root

p

Current nearest

Other children

parent

34

complexity
Complexity

Building a static kd-tree from n points takes O(n log 2n) (can be improved to O(n log n) ) time if an O(n log n) sort is used to compute the median at each level.

Inserting a new point into a balanced kd-tree takes O(log n) time.

Removing a point from a balanced kd-tree takes O(log n) time.

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.

classification techniques1
Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification Techniques

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian classification
Bayesian Classification

69

  • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
  • Foundation: Based on Bayes’ Theorem.
  • Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers
  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data
  • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayes classifier
Bayes Classifier

70

  • A probabilistic framework for solving classification problems
  • Conditional Probability:
  • Bayes theorem:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

example of bayes theorem
Example of Bayes Theorem

71

  • Given:
    • A doctor knows that meningitis causes stiff neck 50% of the time P(S|M)
    • Prior probability of any patient having meningitis is 1/50,000 P(M)
    • Prior probability of any patient having stiff neck is 1/2 P(S)
  • If a patient has stiff neck, what’s the probability he/she has meningitis?
  • Informally, this can be written as

posteriori = likelihood x prior / evidence

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian classifiers
Bayesian Classifiers

72

  • Consider each attribute and class label as random variables
  • Given a record with attributes (A1, A2,…,An)
    • Goal is to predict class C, C=c1, or c2, or …..
    • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
  • Can we estimate P(C| A1, A2,…,An ) directly from data?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian classifiers1
Bayesian Classifiers

73

  • Approach:
    • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
    • Choose value of C that maximizes P(C | A1, A2, …, An)
    • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
  • How to estimate likelihood P(A1, A2, …, An | C )?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

na ve bayes classifier
Naïve Bayes Classifier

74

  • Assume independence among attributes Ai when class is given:
    • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
    • greatly reduces the computation cost: Only counts the class distribution
  • Can estimate P(Ai|Cj) for all Ai and Cj.
  • New point is classified to Cj if P(Cj) P(Ai|Cj) is maximal.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

how to estimate probabilities from data
How to Estimate Probabilities from Data?

75

  • For Class: P(C) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

  • For discrete attributes: P(Ai | Ck) = |Aik|/ Nck

where |Aik| is number of instances having attribute Ai and belongs to class Ck

    • Examples:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

how to estimate probabilities from data1
How to Estimate Probabilities from Data?

76

For continuous attributes:

  • Probability density estimation:
    • Assume attribute follows a normal distribution
    • Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ)
    • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci)

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

slide75

How to Estimate Probabilities from Data?

77

  • Normal distribution:

One for each (Ai,Ci) pair

  • e.g, for (Income, Class=No):
    • If Class=No
      • sample mean = 110
      • sample variance = 2975

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

example of na ve bayes classifier
Example of Naïve Bayes Classifier

Given a Test Record:

  • P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024
  • P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X)=> Class = No

78

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

avoiding the 0 probability problem
Avoiding the 0-Probability Problem
  • E.g. Suppose a dataset with 1000 tuples,
  • income=low (0),
  • income= medium (990),
  • income = high (10),
  • Use Laplacian correction (or Laplacian estimator)
    • Adding 1 to each case, c = 3

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

c: number of classes

p: prior probability

m: parameter

79

  • If one of the conditional probability is zero, then the entire expression becomes zero
  • Probability estimation:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

na ve bayesian classifier comments
Naïve Bayesian Classifier: Comments
  • Advantages
    • Easy to implement
    • Good results obtained in most of the cases
    • Robust to isolated noise points
    • Robust to irrelevant attributes
    • Handle missing values by ignoring the instance during probability estimate calculations

80

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

na ve bayesian classifier comments1
Naïve Bayesian Classifier: Comments
  • Disadvantages
    • Independence assumption may not hold for some attributes
    • Practically, dependencies exist among variables
      • E.g., hospitals: patients: Profile: age, family history, etc.
      • Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
      • Dependencies among these cannot be modeled by Naïve Bayesian Classifier
    • loss of accuracy
  • How to deal with these dependencies?
    • Bayesian Belief Networks (BBN)

81

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian belief networks

Y

Z

P

Bayesian Belief Networks
  • Bayesian belief network allows a subset of the variables conditionally independent
  • A graphical model of causal relationships (directed acyclic graph)
    • Represents dependency among the variables
    • Gives a specification of joint probability distribution
  • Nodes: random variables
  • Links: dependency
  • X and Y are the parents of Z, and Y is the parent of P
  • No dependency between Z and P
  • Has no loops or cycles

X

82

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian belief network an example
Bayesian Belief Network: An Example

Family

History

Smoker

The conditional probability table (CPT) for variable LungCancer:

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

LC

0.8

0.7

0.5

0.1

~LC

0.2

0.5

0.3

0.9

LungCancer

Emphysema

CPT shows the conditional probability for each possible combination of its parents

P(LungCancer = YES | FM = YES, S = YES) =0.8

P(LungCancer = NO | FM = NO, S = NO) =0.9

Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT:

PositiveXRay

Dyspnea

Bayesian Belief Networks

83

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

bayesian belief network an example1
Bayesian Belief Network: An Example

If CPT is known, BBN can be used to

Compute the joint probability of a tuple :

P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N)

Take a node as an “output”, representing a class label attribute

e.g., PositiveXRay class attribute

Predict the class of a tuple

e.g., PXR= ? given FS=N, S=Y, LC=N

compute P(PXY=Y | FS=N, S=Y, LC=N) =a

P(PXY=N | FS=N, S=Y, LC=N) =b

if a > b

PositiveXRay= Yes

Family

History

Smoker

LungCancer

Emphysema

PositiveXRay

Dyspnea

Class

attribute

Bayesian Belief Networks

84

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

training bayesian networks by training data instances
Training Bayesian Networks by training data instances
  • Several scenarios:
    • Given both the network structure and all variables observable: learn only the CPTs
    • Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning
    • Network structure unknown, all variables observable: search through the model space to reconstruct network topology
    • Unknown structure, all hidden variables: No good algorithms known for this purpose

85

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

references
References

Kd-tree

-- Kd-Trees: Another Range Searching Trees

http://www.cs.fsu.edu/~lifeifei/cis5930/kdtree.pdf

http://www.cs.uu.nl/docs/vakken/ga/slides5.pdf

http://3glab.cs.nthu.edu.tw/~spoon/courses/CS631100/Lecture06_handout.pdf

-- Animations of KD-tree searches

http://www.cs.cmu.edu/~awm/animations/kdtree/

Bayesian networks

-- BOOK: D. Heckerman, Bayesian networks for data mining

-- A Tutorial on Learning With Bayesian Networks

http://research.microsoft.com/pubs/69588/tr-95-06.pdf

-- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

questions
Questions

How to evaluate your classifier?

What criteria can you use?

How to compare the performance of classifiers?

How k-nearest neighbors can be used for classification?

How to use Naïve Bayesian classifier?

What are the disadvantages of Naïve Bayesian classifier?

How does Bayesian Network work?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

ad