1 / 19

Evaluating Classifiers

Evaluating Classifiers. Lecture 2 Instructor: Max Welling. Read chapter 5. Evaluation. Given: Hypothesis h(x): X C, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...C] A data-sample S(n) of size n. Questions: What is the error of “h” on unseen data?

jbiggs
Download Presentation

Evaluating Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5

  2. Evaluation • Given: • Hypothesis h(x): XC, in hypothesis space H, • mapping attributes x to classes c=[1,2,3,...C] • A data-sample S(n) of size n. • Questions: • What is the error of “h” on unseen data? • If we have two competing hypotheses, which one is better • on unseen data? • How do we compare two learning algorithms in the face of limited data? • How certain are we about our answers?

  3. Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): where f(x) is the true hypothesis.

  4. Binomial Distributions • Assume you toss a coin n times. • And it has probability p of coming heads (which we will call success) • What is the probability distribution governing the number of heads in n trials? • Answer: the Binomial distribution.

  5. Distribution over Errors • Consider some hypothesis h(x) • Draw n samples X~P(X). • Do this k times. • Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). • {e1,...,ek} are samples from a Binomial distribution ! • Why? imagine a magic coin, where God secretly determines the probability • of heads by the following procedure. First He takes some random hypothesis h. • Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. • If it does, he makes sure the coin lands heads up... • You have a single sample S, for which you observe • e(S) errors. What would be a reasonable estimate for Error(h|P) you think?

  6. Binomial Moments mean • If we match the mean, np, with the observed value n*error(h|S) we find: • If we match the variance we can obtain an estimate of the width:

  7. Confidence Intervals • We would like to state: • With N% confidence we believe that error(h|P) is contained in the interval: 80% Normal(0,1) • In principle is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to • approximate a Binomial by a Gaussian for which we can easily compute “z-values”.

  8. Bias-Variance • The estimator is unbiased if • Imagine again you have infinitely many sample sets X1,X2,.. of size n. • Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) • If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. • Two unbiased estimators can still differ in their • variance (efficiency). Which one do you prefer? p Eav

  9. Flow of Thought • Determine the property you want to know about the future data (e.g. error(h|P)) • Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) • Determine the distribution P(E) of E under the assumption you have infinitely • many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) • Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) • Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. • (sums of random variables converge to normal distributions – central limit theorem) • State you confidence interval as: with confidence N% error(h|P) is contained in the interval

  10. Assumptions • We only consider discrete valued hypotheses (i.e. classes) • Training data and test data are drawn IID from the same distribution P(x). • (IID: independently & identically distributed) • The hypothesis must be chosen independently from the data sample S! Homework: Consider training a classifier h(x) on data S. Argue why the third assumption is violated.

  11. Evaluating Learned h(x) • If we learn h(x) from a sample Sn, it is a bad idea to evaluate it on the same data. • (You will be overly confident of yourself). • Instead, follow the following procedure: • Split the data into k subsets s1,..sK of size ni=N/k. • Learn a hypothesis on complement S-si • Compute error(h|si) on left-out subset si. • After you finished, compute the total error as: • Compute variance as: • This “avoids” violating assumption 3.

  12. Comparing 2 Hypotheses • Assume we like to compare 2 hypothesis h1 and h2, which we have • tested on two independent samples S1 and S2 of size n1 and n2. • I.e. we are interested in the quantity: ? • Define estimator for d: • with X1,X2 sample sets of size n1,n2. • Since error(h1|S1) and error(h2|S2) are both approximately Normal • their difference is approximately Normal with: • Hence, with N% confidence we believe that d is contained in the interval: Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2?

  13. Paired Tests • It is more likely that we want to compare 2 hypothesis on the same data. • E.g. say we have a single data-set S and two hypothesis h1, h2. Split the data • again into subsets s1,..sk • error(h1|s1)=0.1 error(h2|s1)=0.13 • error(h1|s2)=0.2 error(h2|s2)=0.21 • error(h1|s3)=0.66 error(h2|s3)=0.68 • error(h1|s4)=0.45 error(h2|s4)=0.47 • ... and so on. • We have var(error(h1)) = large, var(error(h2)) = large. • However, h1 is consistently better than h2. • We should look at differences error(h1|si)-error(h2|si), • not at differences error(h1|S) - error(h2|S) • Problem:we cannot use: • because that assumes the errors are independent. • Since they are estimates on the same data, they are not independent. • Solution: a “paired t-test” (next slide)

  14. Paired t-test • Chunk the data up in subsets s1,...,sk with |si|>30 • On each subset compute the error and compute: • Now compute: • State: With N% confidence the difference in error between h1 and h2 is: • “t” is the t-statistic which is related to the student-t distribution (table 5.6).

  15. Comparing Learning Algorithms • Again, we split the data into k subsets: S{s1,s2,...sk}. • Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement • of each subset: {S-s1,S-s2,...) to produce hypotheses {L1(S-si), L2(S-si)} for all i. • Compute for all i : • As in the last slide, perform a paired t-test on these differences to compute an • estimate and a confidence interval for the relative error of the hypothesis produced • by L1 and L2. Homework: Are all assumptions for the statistical test respected? If not, find one that is violated. Do you think that this estimate is correct, optimistic or pessimistic?

  16. Evaluation: ROC curves moving threshold class 1 (positives) class 0 (negatives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

  17. Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course)

  18. Appendix The following slide is optional:

  19. Hypothesis Tests • Consider you want to compare again two hypothesis h1 and h2 on • sample sets S1 and S2 of size n1 and n2. • Now we like to answer: will h2 have significantly less error on future data than h1? • We can formulate this as: is the probability P(d>0) larger than 95% (say). • Since this means: • What is the total probability that we observe • when d>0. • This can be computed as: • (move Gauss to the right)

More Related