- By
**dunne** - Follow User

- 106 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'AMCS/CS 340: Data Mining' - dunne

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Classification: EvaluationAMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

Model Evaluation

- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Metrics for Performance Evaluation

- Focus on the predictive capability of a model
- Rather than how fast it takes to classify or build models, scalability, etc.
- Confusion Matrix:
- Most widely-used metric:

TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Limitation of Accuracy

- Consider a 2-class problem
- Number of Class 0 examples = 9990
- Number of Class 1 examples = 10

Unbalanced classes

- If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 %
- Accuracy is misleading because model does not detect any class 1 example

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Model Evaluation

- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods for Performance Evaluation

- How to obtain a reliable estimate of performance?
- Performance of a model may depend on other factors besides the learning algorithm:
- Class distribution
- Cost of misclassification
- Size of training and test sets

7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods of Estimation

- Holdout

Reserve 2/3 for training and 1/3 for testing

- Random subsampling

Repeated holdout

- Cross validation
- Partition data into k disjoint subsets
- k-fold: train on k-1 partitions, test on the remaining one
- Leave-one-out: k=n
- Bootstrap

Sampling with replacement

8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Model Evaluation

- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

Performance of each classifier represented as a point on ROC curve

changing the threshold of algorithm, or sampledistribution changes the location of the point

1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TPR=0.5, FPR=0.12

10

ROC Curve

(TPR,FPR):

(0,0): declare everything to be negative class

(1,1): declare everything to be positive class

(0,1): ideal

Diagonal line:

Random guessing

Below diagonal line:

prediction is opposite of the true class

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Using ROC for Model Comparison

- No model consistently outperform the other
- M1 is better for small FPR
- M2 is better for large FPR
- Area Under the ROC curve
- Ideal:
- Area = 1
- Random guess:
- Area = 0.5

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to construct an ROC curve

Posterior probability of test instance x

Threshold: t

# of + >= t

# of - >= t

ROC Curve:

13

Confidence Interval for Accuracy

- Prediction can be regarded as a Bernoulli trial
- A Bernoulli trial has 2 possible outcomes
- Possible outcomes for prediction: correct or wrong
- Collection of Bernoulli trials has a Binomial distribution:
- x Bin(N, p) x: number of correct predictions
- e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50 0.5 = 25
- Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)?

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy

Area = 1 -

- For large test sets (N > 30),

acchas a normal distribution with mean p and variance p(1-p)/N

- Confidence Interval for p:

Z/2

Z1- /2

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy

- Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:
- N=100, acc = 0.8
- Let 1- = 0.95 (95% confidence)
- From probability table, Z/2=1.96

Standard Normal distribution

16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test of Significance

- Given two models:
- Model M1: accuracy = 85%, tested on 30 instances
- Model M2: accuracy = 75%, tested on 5000 instances
- Can we say M1 is better than M2?
- How much confidence can we place on accuracy of M1 and M2?
- Can the difference in performance measure be explained as a result of random fluctuations in the test set?

17

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models

- Given two models, say M1 and M2, which is better?
- M1 is tested on D1 (size=n1), found error rate = e1
- M2 is tested on D2 (size=n2), found error rate = e2
- Assume D1 and D2 are independent
- If n1 and n2 are sufficiently large, then
- Approximate of variance (Binomial distribution):

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models

- To test if performance difference is statistically significant:
- d = e1 – e2
- where dt is the true difference
- Since D1 and D2 are independent, their variance adds up:
- At (1-) confidence level,

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An Illustrative Example

- Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25
- d = |e2 – e1| = 0.1 (2-sided test)
- At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant

20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification TechniquesXiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test Record

Training Records

Choose k of the “nearest” records

Nearest Neighbor Classifiers22

- Basic idea:
- If it walks like a duck, quacks like a duck, then it’s probably a duck

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

23

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Nearest Neighbor Classifiers

- Requires three things
- The set of stored records
- Distance Metric to compute distance between records
- The value of k, the number of nearest neighbors to retrieve
- To classify an unknown record:
- Compute distance to other training records
- Identify k nearest neighbors
- Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote)

24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest Neighbor Classification

25

- Compute distance between unknown record and all training data:
- Euclidean distance
- Find k neatest neighbors
- Determine the class from nearest neighbor list
- take the majority vote of class labels among the k-nearest neighbors
- weight the vote according to distance
- weight factor, w = 1/d2, w=exp(-d2/t), etc

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

1 nearest-neighbor

Voronoi Diagram (nearest neighbor regions)

- Voronoidiagram
- The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites.
- The Voronoi nodes are the points equidistant to three (or more) sites.

26

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k of k-nn

27

- Choosing the value of k:
- If k is too small, sensitive to noise points
- If k is too large, neighborhood may include points from other classes

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Normalization of attributes

- Scaling issues
- Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
- Example:
- height of a person may vary from 1.5m to 1.8m
- weight of a person may vary from 90lb to 300lb
- income of a person may vary from $10K to $1M
- Solution: Normalize the vectors to unit length
- Problem with Euclidean measure:
- High dimensional data curse of dimensionality

28

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest neighbor Classification

k-NN classifiers are lazy learners

It does not build models explicitly

Robust to noisy data by averaging k-nearest neighbors

Unlike eager learners such as decision tree induction and rule-based systems

Classifying unknown records are relatively expensive

29

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-dimensional tree (kd-tree)

- efficient way of nearest neighbor searches
- space-partitioning data structure for organizing points in a k-dimensional space.

30

Example: 2d-tree

- A recursive space partitioning tree.
- Partition along x and y axis in an alternating fashion.
- Each internal node stores the splitting node along x (or y).
- e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

31

k-dimensional tree (kd-tree)

- Searching for a nearest neighbor of p in a kd-tree
- Start with the root node
- Move down the tree recursively
- Reach a leaf “current nearest”
- Unwind the recursion,
- check the parent’s other children, is there a
- intersection with potential nearer neighbor ?
- if no, go up to further level
- if yes, check the children
- if t is closer to p, t “current nearest”
- Repeat until reach the root

p

Current nearest

Other children

parent

34

Complexity

Building a static kd-tree from n points takes O(n log 2n) (can be improved to O(n log n) ) time if an O(n log n) sort is used to compute the median at each level.

Inserting a new point into a balanced kd-tree takes O(log n) time.

Removing a point from a balanced kd-tree takes O(log n) time.

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.

Decision Tree based Methods

Rule-based Methods

Learning from Neighbors

Bayesian Classification

Neural Networks

Ensemble Methods

Support Vector Machines

Classification TechniquesXiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classification

69

- A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
- Foundation: Based on Bayes’ Theorem.
- Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers
- Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data
- Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayes Classifier

70

- A probabilistic framework for solving classification problems
- Conditional Probability:
- Bayes theorem:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Bayes Theorem

71

- Given:
- A doctor knows that meningitis causes stiff neck 50% of the time P(S|M)
- Prior probability of any patient having meningitis is 1/50,000 P(M)
- Prior probability of any patient having stiff neck is 1/2 P(S)
- If a patient has stiff neck, what’s the probability he/she has meningitis?
- Informally, this can be written as

posteriori = likelihood x prior / evidence

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers

72

- Consider each attribute and class label as random variables
- Given a record with attributes (A1, A2,…,An)
- Goal is to predict class C, C=c1, or c2, or …..
- Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
- Can we estimate P(C| A1, A2,…,An ) directly from data?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers

73

- Approach:
- compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
- Choose value of C that maximizes P(C | A1, A2, …, An)
- Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
- How to estimate likelihood P(A1, A2, …, An | C )?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayes Classifier

74

- Assume independence among attributes Ai when class is given:
- P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
- greatly reduces the computation cost: Only counts the class distribution
- Can estimate P(Ai|Cj) for all Ai and Cj.
- New point is classified to Cj if P(Cj) P(Ai|Cj) is maximal.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

75

- For Class: P(C) = Nc/N

e.g., P(No) = 7/10, P(Yes) = 3/10

- For discrete attributes: P(Ai | Ck) = |Aik|/ Nck

where |Aik| is number of instances having attribute Ai and belongs to class Ck

- Examples:

P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

76

For continuous attributes:

- Probability density estimation:
- Assume attribute follows a normal distribution
- Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ)
- Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci)

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data?

77

- Normal distribution:

One for each (Ai,Ci) pair

- e.g, for (Income, Class=No):
- If Class=No
- sample mean = 110
- sample variance = 2975

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Naïve Bayes Classifier

Given a Test Record:

- P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024
- P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X)=> Class = No

78

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Avoiding the 0-Probability Problem

- E.g. Suppose a dataset with 1000 tuples,
- income=low (0),
- income= medium (990),
- income = high (10),
- Use Laplacian correction (or Laplacian estimator)
- Adding 1 to each case, c = 3

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

c: number of classes

p: prior probability

m: parameter

79

- If one of the conditional probability is zero, then the entire expression becomes zero
- Probability estimation:

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayesian Classifier: Comments

- Advantages
- Easy to implement
- Good results obtained in most of the cases
- Robust to isolated noise points
- Robust to irrelevant attributes
- Handle missing values by ignoring the instance during probability estimate calculations

80

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayesian Classifier: Comments

- Disadvantages
- Independence assumption may not hold for some attributes
- Practically, dependencies exist among variables
- E.g., hospitals: patients: Profile: age, family history, etc.
- Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
- Dependencies among these cannot be modeled by Naïve Bayesian Classifier
- loss of accuracy
- How to deal with these dependencies?
- Bayesian Belief Networks (BBN)

81

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Z

P

Bayesian Belief Networks- Bayesian belief network allows a subset of the variables conditionally independent
- A graphical model of causal relationships (directed acyclic graph)
- Represents dependency among the variables
- Gives a specification of joint probability distribution

- Nodes: random variables
- Links: dependency
- X and Y are the parents of Z, and Y is the parent of P
- No dependency between Z and P
- Has no loops or cycles

X

82

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example

Family

History

Smoker

The conditional probability table (CPT) for variable LungCancer:

(FH, S)

(FH, ~S)

(~FH, S)

(~FH, ~S)

LC

0.8

0.7

0.5

0.1

~LC

0.2

0.5

0.3

0.9

LungCancer

Emphysema

CPT shows the conditional probability for each possible combination of its parents

P(LungCancer = YES | FM = YES, S = YES) =0.8

P(LungCancer = NO | FM = NO, S = NO) =0.9

Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT:

PositiveXRay

Dyspnea

Bayesian Belief Networks

83

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example

If CPT is known, BBN can be used to

Compute the joint probability of a tuple :

P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N)

Take a node as an “output”, representing a class label attribute

e.g., PositiveXRay class attribute

Predict the class of a tuple

e.g., PXR= ? given FS=N, S=Y, LC=N

compute P(PXY=Y | FS=N, S=Y, LC=N) =a

P(PXY=N | FS=N, S=Y, LC=N) =b

if a > b

PositiveXRay= Yes

Family

History

Smoker

LungCancer

Emphysema

PositiveXRay

Dyspnea

Class

attribute

Bayesian Belief Networks

84

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Training Bayesian Networks by training data instances

- Several scenarios:
- Given both the network structure and all variables observable: learn only the CPTs
- Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning
- Network structure unknown, all variables observable: search through the model space to reconstruct network topology
- Unknown structure, all hidden variables: No good algorithms known for this purpose

85

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References

Kd-tree

-- Kd-Trees: Another Range Searching Trees

http://www.cs.fsu.edu/~lifeifei/cis5930/kdtree.pdf

http://www.cs.uu.nl/docs/vakken/ga/slides5.pdf

http://3glab.cs.nthu.edu.tw/~spoon/courses/CS631100/Lecture06_handout.pdf

-- Animations of KD-tree searches

http://www.cs.cmu.edu/~awm/animations/kdtree/

Bayesian networks

-- BOOK: D. Heckerman, Bayesian networks for data mining

-- A Tutorial on Learning With Bayesian Networks

http://research.microsoft.com/pubs/69588/tr-95-06.pdf

-- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Questions

How to evaluate your classifier?

What criteria can you use?

How to compare the performance of classifiers?

How k-nearest neighbors can be used for classification?

How to use Naïve Bayesian classifier?

What are the disadvantages of Naïve Bayesian classifier?

How does Bayesian Network work?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Download Presentation

Connecting to Server..