1 / 41

How to be a Bayesian without believing

How to be a Bayesian without believing. Yoav Freund Joint work with Rob Schapire and Yishay Mansour. Motivation. Statistician: “Are you a Bayesian or a Frequentist?” Yoav: “I don’t know, you tell me…” I need a better answer…. Male. Human Voice. Female. Toy example.

marlo
Download Presentation

How to be a Bayesian without believing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour

  2. Motivation • Statistician: “Are you a Bayesian or a Frequentist?” • Yoav: “I don’t know, you tell me…” • I need a better answer….

  3. Male Human Voice Female Toy example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

  4. mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch

  5. No. of mistakes Discriminative approach Voice Pitch

  6. Conditional probability: Prior Posterior Probability Discriminative Bayesian approach Voice Pitch

  7. Definitely female Definitely male No. of mistakes Suggested approach Unsure Voice Pitch

  8. Formal Frameworks For stating theorems regarding the dependence of the generalization error on the size of the training set.

  9. The PAC set-up • Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples • Nature chooses a target classifier c from C and a distribution P over X • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X  {-1,+1} Goal: P(h(x) c(x)) <  c,P

  10. The agnostic set-up Vapnik’s pattern-recognition problem • Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples • Nature chooses distribution D overX  {-1,+1} • Nature generates training set according to D (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X  {-1,+1} Goal: PD(h(x)  y) < PD(c*(x)  y) +  D Where c* = argminc  C(PD(c(x)  y))

  11. bound depends on training set ! Self-bounding learning Freund 97 • Learner selects concept class C • Nature generates training set T=(x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} • Learner generates h: X  {-1,+1} and a bound T such that with high probability over the random choice of the training set T PD(h(x)  y) < PD(c*(x)  y) + T

  12. Learning a region predictor Vovk 2000 • Learner selects concept class C • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} • Learner generates h: X  { {-1}, {+1}, {-1,+1} , {} } such that with high probability PD(y  h(x)) < PD(c*(x)  y) + 1 and PD(h(x)={-1,+1} ) < 2

  13. Intuitions The rough idea

  14. - - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?

  15. True error 0 1/2 Empirical error Worst case 0 1/2 Typical case 0 1/2 Contenders for best. -> Predict with majority vote Non-contenders -> ignore! Distribution of errors

  16. Main result Finite concept class

  17. Data distribution: Generalization error: Training set: Training error: Notation

  18. Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm

  19. Suggested tuning Yields:

  20. Main properties • The ELR is very stable. Probability of large deviations is independent of size of concept class. • Expected value of ELR is close to the True Log Ratio (using true hypothesis errors instead of estimates.) • TLR is a good proxy of the best concept in the class.

  21. If And are independent random variables Then McDiarmid’s theorem

  22. training error with one example changed Empirical log ratio is stable

  23. Bounded variation proof

  24. Infinite concept classes Geometry of the concept class

  25. Infinite concept classes • Stated bounds are vacuous. • How to approximate a infinite class with a finite class? • Unlabeled examples give useful information.

  26. d f g d(f,g) = P( f(x) = g(x) ) A metric space of classifiers Classifier space Example Space Neighboring models make similar predictions

  27. No. of neighbors increases like No. of neighbors increases like e-covers Classifier space Classifier class

  28. Computational issues • How to compute the e-cover? • We can use unlabeled examples to generate cover. • Estimate prediction by ignoring concepts with high error.

  29. Application: comparing perfect features • 45,000 features • Training Examples: • 102 negative • 2-10 positive • 104 unlabeled • >1 features have zero training error. • Which feature(s) should we use? • How to combine them?

  30. Unlabeld examples Positive examples Negative examples A typical perfect feature No. of images Feature value

  31. Pseudo-Bayes for single threhold • Set of possible thresholds is uncountably infinite • Using an e-cover over thresholds • Equivalent to using the distribution of unlabeled examples as the prior distribution over the set of thresholds.

  32. +1 0 -1 Prior weights Error factor Negative examples What it will do Feature value

  33. Neighborhood of good classifiers Relation to large margins SVM and Adaboost search for a linear discriminator with a large margin

  34. Relation to Bagging • Bagging: • Generate classifiers from random subsets of training set. • Predict according to the majority vote among classifiers. (Another possibility: flip label of a small random subset of the training set) • Can be seen as a randomized estimate of the log ratio.

  35. Bias/Variance for classification • Bias: error of predicting with the sign of the True Log Ratio (infinite training set). • Variance: additional error from predicting with the sign of the Empirical Log Ratio which is based on a finite training sample.

  36. New directions How a measure of confidence can help in practice

  37. Face Detection • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

  38. All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2

  39. Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier

  40. Confident Predictions Confident Predictions Co-training Partially trained Color based Classifier Color info Images that Might contain faces Shape info Partially trained Shape based Classifier

  41. Summary • Bayesian averaging is justifiable even without Bayesian assumptions. • Infinite concept classes: use e-covers • Efficient implementations: Thresholds, SVM, boosting, bagging… still largely open. • Calibration (Recent work of Vovk) • A good measure of confidence is very important in practice. • >2 classes (predicting with a subset)

More Related