1 / 85

AMCS/CS 340: Data Mining

Classification: Evaluation. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation

dunne
Download Presentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification: Evaluation AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models? 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Metrics for Performance Evaluation • Focus on the predictive capability of a model • Rather than how fast it takes to classify or build models, scalability, etc. • Confusion Matrix: • Most widely-used metric: TP: True Positive FP: False Positive TN: True Negative FN: False Negative 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. Limitation of Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 Unbalanced classes • If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 % • Accuracy is misleading because model does not detect any class 1 example 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. Other Measures 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models? 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. Methods for Performance Evaluation • How to obtain a reliable estimate of performance? • Performance of a model may depend on other factors besides the learning algorithm: • Class distribution • Cost of misclassification • Size of training and test sets 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Methods of Estimation • Holdout Reserve 2/3 for training and 1/3 for testing • Random subsampling Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Bootstrap Sampling with replacement 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models? 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sampledistribution changes the location of the point 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TPR=0.5, FPR=0.12 10

  11. ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Using ROC for Model Comparison • No model consistently outperform the other • M1 is better for small FPR • M2 is better for large FPR • Area Under the ROC curve • Ideal: • Area = 1 • Random guess: • Area = 0.5 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. How to construct an ROC curve Posterior probability of test instance x Threshold: t # of + >= t # of - >= t ROC Curve: 13

  14. Confidence Interval for Accuracy • Prediction can be regarded as a Bernoulli trial • A Bernoulli trial has 2 possible outcomes • Possible outcomes for prediction: correct or wrong • Collection of Bernoulli trials has a Binomial distribution: • x  Bin(N, p) x: number of correct predictions • e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25 • Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)? 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Confidence Interval for Accuracy Area = 1 -  • For large test sets (N > 30), acchas a normal distribution with mean p and variance p(1-p)/N • Confidence Interval for p: Z/2 Z1-  /2 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Confidence Interval for Accuracy • Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: • N=100, acc = 0.8 • Let 1- = 0.95 (95% confidence) • From probability table, Z/2=1.96 Standard Normal distribution 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. Test of Significance • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set? 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Comparing Performance of 2 Models • Given two models, say M1 and M2, which is better? • M1 is tested on D1 (size=n1), found error rate = e1 • M2 is tested on D2 (size=n2), found error rate = e2 • Assume D1 and D2 are independent • If n1 and n2 are sufficiently large, then • Approximate of variance (Binomial distribution): 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Comparing Performance of 2 Models • To test if performance difference is statistically significant: • d = e1 – e2 • where dt is the true difference • Since D1 and D2 are independent, their variance adds up: • At (1-) confidence level, 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. An Illustrative Example • Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 • d = |e2 – e1| = 0.1 (2-sided test) • At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  22. Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers 22 • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  23. Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Nearest Neighbor Classifiers • Requires three things • The set of stored records • Distance Metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote) 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. k Nearest Neighbor Classification 25 • Compute distance between unknown record and all training data: • Euclidean distance • Find k neatest neighbors • Determine the class from nearest neighbor list • take the majority vote of class labels among the k-nearest neighbors • weight the vote according to distance • weight factor, w = 1/d2, w=exp(-d2/t), etc Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. 1 nearest-neighbor Voronoi Diagram (nearest neighbor regions) • Voronoidiagram • The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites. • The Voronoi nodes are the points equidistant to three (or more) sites. 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. k of k-nn 27 • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. Normalization of attributes • Scaling issues • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes • Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M • Solution: Normalize the vectors to unit length • Problem with Euclidean measure: • High dimensional data curse of dimensionality 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. k Nearest neighbor Classification k-NN classifiers are lazy learners It does not build models explicitly Robust to noisy data by averaging k-nearest neighbors Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  30. k-dimensional tree (kd-tree) • efficient way of nearest neighbor searches • space-partitioning data structure for organizing points in a k-dimensional space. 30

  31. Example: 2d-tree • A recursive space partitioning tree. • Partition along x and y axis in an alternating fashion. • Each internal node stores the splitting node along x (or y). • e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane. 31

  32. k-dimensional tree (kd-tree) • Searching for a nearest neighbor of p in a kd-tree • Start with the root node • Move down the tree recursively • Reach a leaf  “current nearest” • Unwind the recursion, • check the parent’s other children, is there a • intersection with potential nearer neighbor ? • if no, go up to further level • if yes, check the children • if t is closer to p, t  “current nearest” • Repeat until reach the root p Current nearest Other children parent 34

More Related