1 / 22

On-line learning and Boosting

On-line learning and Boosting. Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering. Hedge - Motivation.

yaphet
Download Presentation

On-line learning and Boosting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering

  2. Hedge - Motivation • Generalization of Weighted Majority Algorithm • Given a set of expert predictions, minimize mistakes over time • Slight emphasis in motivation on possibility of treating wtas a prior.

  3. Hedge Algorithm • Parameters , w, T • For 1..T • Choose allocation p (probability distribution formed from weights) • Receive loss vector l • Suffer loss p  l • Set new weight vector to w  l

  4. Hedge Analysis • Does not perform “too much worse” than best strategy: • LHedge()  ( - ln (w1) – Li ln  ) · Z • Z = 1 / (1 - ) • Is it possible to do better?

  5. Boosting • If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them • Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them

  6. Definitions • Given labeled data < x, c(x) >, where c is the targetconcept, c: X {0, 1}. • c  C, the concept class • Strong PAC-learning algorithm: For parameters ,, hypothesis has error less than  with probability (1-) • Weak algorithm:   (0.5 - ),  > 0

  7. AdaBoost Algorithm • Input: • Sequence of N labeled examples • Distribution D over the N examples • Weak learning algorithm (called WeakLearn) • Number of iterations T

  8. AdaBoost contd. • Initialize: w1 = D • For t =1..T • Form probability distribution p from w • Call WeakLearn with distribution p • Calculate error t = i=1..N pi | ht(xi) – yi | • Set t = t / (1 - t) • Multiplicatively adjust weights (w)by t 1-|ht(xi)–yi|

  9. AdaBoost Output • Output (+1) if: • t=1..T (log 1/t) ht(x)  ½ t=1..T log 1/t • 0 otherwise • Computes a weighted average

  10. AdaBoost Analysis • Note of “dual” relationship with Hedge • Strategies  Examples • Trials  Weak hypotheses • Hedge increases weight for successful strategies, AdaBoost increases weight for difficult examples • AdaBoost has dynamic 

  11. AdaBoost Bounds •   2T t=1..T sqrt(t(1 - t)) • Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome) • AdaBoost takes advantage of gains from best hypotheses

  12. Multi-class Setting • k > 2 output labels, i.e. Y = {1, 2, …, k} • Error: Probability of incorrect prediction • Two algorithms: • AdaBoost.M1 – More direct • AdaBoost.M2 – Somewhat complex constraints on weak learners • Could also just divide into “one vs. one” or “one vs. all” categories

  13. AdaBoost.M1 • Requires each classifier to have error less than 50% (stronger requirement than binary case) • Similar to regular AdaBoost algorithm except: • Error is 1 if ht(xi)  yi • Can’t use algorithms with error > 0.5 • Algorithm outputs vector of length k with values between 0 & 1

  14. AdaBoost.M1 Analysis •   2T t=1..T sqrt(t(1 - t)) • Same as bounds for regular AdaBoost • Proof converts multi-class problem to a binary setup • Can we improve this algorithm?

  15. AdaBoost.M2 • More expressive, more complex constraints on weak hypotheses • Defines idea of “Pseudo-Loss” • Pseudo-loss of each weak hypothesis must be better than chance • Benefit: Allows contributions from hypotheses with accuracy < 0.5

  16. Pseudo-loss • Replaces straightforward loss of AdaBoost.M1 • plossq(h,i) = 0.5 ( 1 – h(xi,yi) + yyi q(i,y) h(xi,y) • Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average. • Makes use of information in entire hypothesis vector, not just prediction

  17. AdaBoost.M2 Details • Extra init: wti,y = D(i) / (k-1) • For each iteration t = 1 to T • Wti = yyi wti,y • qt(i,y) = wti,y / Wti • Dt(i) = Wti / i=1..N Wti • WeakLearn gets D as well as q • Calculate t as shown above • t = t / (1 - t) • wti,y· t(0.5)(1 + ht(xi,yi) – ht(xi,y))

  18. Error Bounds •   (k – 1) 2Tt=1..T sqrt(t(1 - t)) • Where  is traditional error and the t are pseudo-losses

  19. Regression Setting • Instead of picking from a discrete set of output labels, choose a continuous value • More formally Y = [0, 1] • Minimize the mean squared error: • E[(h(x) – y)2] • Reduce to binary classification and use AdaBoost!

  20. How it works (roughly) • For each example in training set, create continuum of associated instances xtilde(xi, y) where y [0, 1]. • Label is 1 if y  yi • Mapping to an infinite training set – need to convert discrete distributions to density functions

  21. AdaBoost.R Bounds •   2T t=1..T sqrt(t (1 - t))

  22. Conclusions • Starting from a on-line learning perspective, it is possible to generalize to boosting • Boosting can take weak learners and convert them to strong learners • This paper presented several algorithms to do boosting, with proofs of error bounds

More Related