Evaluating Classifiers

Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5

Evaluation • Given: • Hypothesis h(x): XC, in hypothesis space H, • mapping attributes x to classes c=[1,2,3,...C] • A data-sample S(n) of size n. • Questions: • What is the error of “h” on unseen data? • If we have two competing hypotheses, which one is better • on unseen data? • How do we compare two learning algorithms in the face of limited data? • How certain are we about our answers?

Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): where f(x) is the true hypothesis.

Binomial Distributions • Assume you toss a coin n times. • And it has probability p of coming heads (which we will call success) • What is the probability distribution governing the number of heads in n trials? • Answer: the Binomial distribution.

Distribution over Errors • Consider some hypothesis h(x) • Draw n samples X~P(X). • Do this k times. • Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). • {e1,...,ek} are samples from a Binomial distribution ! • Why? imagine a magic coin, where God secretly determines the probability • of heads by the following procedure. First He takes some random hypothesis h. • Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. • If it does, he makes sure the coin lands heads up... • You have a single sample S, for which you observe • e(S) errors. What would be a reasonable estimate for Error(h|P) you think?

Binomial Moments mean • If we match the mean, np, with the observed value n*error(h|S) we find: • If we match the variance we can obtain an estimate of the width:

Confidence Intervals • We would like to state: • With N% confidence we believe that error(h|P) is contained in the interval: 80% Normal(0,1) • In principle is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to • approximate a Binomial by a Gaussian for which we can easily compute “z-values”.

Bias-Variance • The estimator is unbiased if • Imagine again you have infinitely many sample sets X1,X2,.. of size n. • Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) • If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. • Two unbiased estimators can still differ in their • variance (efficiency). Which one do you prefer? p Eav

Flow of Thought • Determine the property you want to know about the future data (e.g. error(h|P)) • Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) • Determine the distribution P(E) of E under the assumption you have infinitely • many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) • Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) • Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. • (sums of random variables converge to normal distributions – central limit theorem) • State you confidence interval as: with confidence N% error(h|P) is contained in the interval

Assumptions • We only consider discrete valued hypotheses (i.e. classes) • Training data and test data are drawn IID from the same distribution P(x). • (IID: independently & identically distributed) • The hypothesis must be chosen independently from the data sample S! Homework: Consider training a classifier h(x) on data S. Argue why the third assumption is violated.

Evaluating Learned h(x) • If we learn h(x) from a sample Sn, it is a bad idea to evaluate it on the same data. • (You will be overly confident of yourself). • Instead, follow the following procedure: • Split the data into k subsets s1,..sK of size ni=N/k. • Learn a hypothesis on complement S-si • Compute error(h|si) on left-out subset si. • After you finished, compute the total error as: • Compute variance as: • This “avoids” violating assumption 3.

Comparing 2 Hypotheses • Assume we like to compare 2 hypothesis h1 and h2, which we have • tested on two independent samples S1 and S2 of size n1 and n2. • I.e. we are interested in the quantity: ? • Define estimator for d: • with X1,X2 sample sets of size n1,n2. • Since error(h1|S1) and error(h2|S2) are both approximately Normal • their difference is approximately Normal with: • Hence, with N% confidence we believe that d is contained in the interval: Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2?

Paired Tests • It is more likely that we want to compare 2 hypothesis on the same data. • E.g. say we have a single data-set S and two hypothesis h1, h2. Split the data • again into subsets s1,..sk • error(h1|s1)=0.1 error(h2|s1)=0.13 • error(h1|s2)=0.2 error(h2|s2)=0.21 • error(h1|s3)=0.66 error(h2|s3)=0.68 • error(h1|s4)=0.45 error(h2|s4)=0.47 • ... and so on. • We have var(error(h1)) = large, var(error(h2)) = large. • However, h1 is consistently better than h2. • We should look at differences error(h1|si)-error(h2|si), • not at differences error(h1|S) - error(h2|S) • Problem:we cannot use: • because that assumes the errors are independent. • Since they are estimates on the same data, they are not independent. • Solution: a “paired t-test” (next slide)

Paired t-test • Chunk the data up in subsets s1,...,sk with |si|>30 • On each subset compute the error and compute: • Now compute: • State: With N% confidence the difference in error between h1 and h2 is: • “t” is the t-statistic which is related to the student-t distribution (table 5.6).

Comparing Learning Algorithms • Again, we split the data into k subsets: S{s1,s2,...sk}. • Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement • of each subset: {S-s1,S-s2,...) to produce hypotheses {L1(S-si), L2(S-si)} for all i. • Compute for all i : • As in the last slide, perform a paired t-test on these differences to compute an • estimate and a confidence interval for the relative error of the hypothesis produced • by L1 and L2. Homework: Are all assumptions for the statistical test respected? If not, find one that is violated. Do you think that this estimate is correct, optimistic or pessimistic?

Evaluation: ROC curves moving threshold class 1 (positives) class 0 (negatives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course)

Appendix The following slide is optional:

Hypothesis Tests • Consider you want to compare again two hypothesis h1 and h2 on • sample sets S1 and S2 of size n1 and n2. • Now we like to answer: will h2 have significantly less error on future data than h1? • We can formulate this as: is the probability P(d>0) larger than 95% (say). • Since this means: • What is the total probability that we observe • when d>0. • This can be computed as: • (move Gauss to the right)

Evaluating Classifiers