AMCS/CS 340: Data Mining

Classification: Evaluation AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models? 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Metrics for Performance Evaluation • Focus on the predictive capability of a model • Rather than how fast it takes to classify or build models, scalability, etc. • Confusion Matrix: • Most widely-used metric: TP: True Positive FP: False Positive TN: True Negative FN: False Negative 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Limitation of Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 Unbalanced classes • If model predictseverything to be class 0, accuracy is 9990/10000 = 99.9 % • Accuracy is misleading because model does not detect any class 1 example 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Other Measures 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods for Performance Evaluation • How to obtain a reliable estimate of performance? • Performance of a model may depend on other factors besides the learning algorithm: • Class distribution • Cost of misclassification • Size of training and test sets 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods of Estimation • Holdout Reserve 2/3 for training and 1/3 for testing • Random subsampling Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Bootstrap Sampling with replacement 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sampledistribution changes the location of the point 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TPR=0.5, FPR=0.12 10

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Using ROC for Model Comparison • No model consistently outperform the other • M1 is better for small FPR • M2 is better for large FPR • Area Under the ROC curve • Ideal: • Area = 1 • Random guess: • Area = 0.5 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to construct an ROC curve Posterior probability of test instance x Threshold: t # of + >= t # of - >= t ROC Curve: 13

Confidence Interval for Accuracy • Prediction can be regarded as a Bernoulli trial • A Bernoulli trial has 2 possible outcomes • Possible outcomes for prediction: correct or wrong • Collection of Bernoulli trials has a Binomial distribution: • x  Bin(N, p) x: number of correct predictions • e.g: Toss a fair coin 50 times, how many heads would turn up?Expected number of heads = Np = 50  0.5 = 25 • Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),Can we predict p (true accuracy of model)? 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy Area = 1 -  • For large test sets (N > 30), acchas a normal distribution with mean p and variance p(1-p)/N • Confidence Interval for p: Z/2 Z1-  /2 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy • Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: • N=100, acc = 0.8 • Let 1- = 0.95 (95% confidence) • From probability table, Z/2=1.96 Standard Normal distribution 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test of Significance • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set? 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models • Given two models, say M1 and M2, which is better? • M1 is tested on D1 (size=n1), found error rate = e1 • M2 is tested on D2 (size=n2), found error rate = e2 • Assume D1 and D2 are independent • If n1 and n2 are sufficiently large, then • Approximate of variance (Binomial distribution): 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models • To test if performance difference is statistically significant: • d = e1 – e2 • where dt is the true difference • Since D1 and D2 are independent, their variance adds up: • At (1-) confidence level, 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An Illustrative Example • Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 • d = |e2 – e1| = 0.1 (2-sided test) • At 95% confidence level, Z/2=1.96=> Interval contains 0 => difference may not be statistically significant 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers 22 • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Nearest Neighbor Classifiers • Requires three things • The set of stored records • Distance Metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighborsto determine the class label of unknown record (e.g., by taking majority vote) 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest Neighbor Classification 25 • Compute distance between unknown record and all training data: • Euclidean distance • Find k neatest neighbors • Determine the class from nearest neighbor list • take the majority vote of class labels among the k-nearest neighbors • weight the vote according to distance • weight factor, w = 1/d2, w=exp(-d2/t), etc Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

1 nearest-neighbor Voronoi Diagram (nearest neighbor regions) • Voronoidiagram • The segments of the Voronoi diagram are all the points in the plane that are equidistant to the two nearest sites. • The Voronoi nodes are the points equidistant to three (or more) sites. 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k of k-nn 27 • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Normalization of attributes • Scaling issues • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes • Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M • Solution: Normalize the vectors to unit length • Problem with Euclidean measure: • High dimensional data curse of dimensionality 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k Nearest neighbor Classification k-NN classifiers are lazy learners It does not build models explicitly Robust to noisy data by averaging k-nearest neighbors Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

k-dimensional tree (kd-tree) • efficient way of nearest neighbor searches • space-partitioning data structure for organizing points in a k-dimensional space. 30

Example: 2d-tree • A recursive space partitioning tree. • Partition along x and y axis in an alternating fashion. • Each internal node stores the splitting node along x (or y). • e.g. the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane. 31

k-dimensional tree (kd-tree) • Searching for a nearest neighbor of p in a kd-tree • Start with the root node • Move down the tree recursively • Reach a leaf  “current nearest” • Unwind the recursion, • check the parent’s other children, is there a • intersection with potential nearer neighbor ? • if no, go up to further level • if yes, check the children • if t is closer to p, t  “current nearest” • Repeat until reach the root p Current nearest Other children parent 34

AMCS/CS 340: Data Mining