On-line learning and Boosting

On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering

Hedge - Motivation • Generalization of Weighted Majority Algorithm • Given a set of expert predictions, minimize mistakes over time • Slight emphasis in motivation on possibility of treating wtas a prior.

Hedge Algorithm • Parameters , w, T • For 1..T • Choose allocation p (probability distribution formed from weights) • Receive loss vector l • Suffer loss p  l • Set new weight vector to w  l

Hedge Analysis • Does not perform “too much worse” than best strategy: • LHedge()  ( - ln (w1) – Li ln  ) · Z • Z = 1 / (1 - ) • Is it possible to do better?

Boosting • If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them • Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them

Definitions • Given labeled data < x, c(x) >, where c is the targetconcept, c: X {0, 1}. • c  C, the concept class • Strong PAC-learning algorithm: For parameters ,, hypothesis has error less than  with probability (1-) • Weak algorithm:   (0.5 - ),  > 0

AdaBoost Algorithm • Input: • Sequence of N labeled examples • Distribution D over the N examples • Weak learning algorithm (called WeakLearn) • Number of iterations T

AdaBoost contd. • Initialize: w1 = D • For t =1..T • Form probability distribution p from w • Call WeakLearn with distribution p • Calculate error t = i=1..N pi | ht(xi) – yi | • Set t = t / (1 - t) • Multiplicatively adjust weights (w)by t 1-|ht(xi)–yi|

AdaBoost Output • Output (+1) if: • t=1..T (log 1/t) ht(x)  ½ t=1..T log 1/t • 0 otherwise • Computes a weighted average

AdaBoost Analysis • Note of “dual” relationship with Hedge • Strategies  Examples • Trials  Weak hypotheses • Hedge increases weight for successful strategies, AdaBoost increases weight for difficult examples • AdaBoost has dynamic 

AdaBoost Bounds •   2T t=1..T sqrt(t(1 - t)) • Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome) • AdaBoost takes advantage of gains from best hypotheses

Multi-class Setting • k > 2 output labels, i.e. Y = {1, 2, …, k} • Error: Probability of incorrect prediction • Two algorithms: • AdaBoost.M1 – More direct • AdaBoost.M2 – Somewhat complex constraints on weak learners • Could also just divide into “one vs. one” or “one vs. all” categories

AdaBoost.M1 • Requires each classifier to have error less than 50% (stronger requirement than binary case) • Similar to regular AdaBoost algorithm except: • Error is 1 if ht(xi)  yi • Can’t use algorithms with error > 0.5 • Algorithm outputs vector of length k with values between 0 & 1

AdaBoost.M1 Analysis •   2T t=1..T sqrt(t(1 - t)) • Same as bounds for regular AdaBoost • Proof converts multi-class problem to a binary setup • Can we improve this algorithm?

AdaBoost.M2 • More expressive, more complex constraints on weak hypotheses • Defines idea of “Pseudo-Loss” • Pseudo-loss of each weak hypothesis must be better than chance • Benefit: Allows contributions from hypotheses with accuracy < 0.5

Pseudo-loss • Replaces straightforward loss of AdaBoost.M1 • plossq(h,i) = 0.5 ( 1 – h(xi,yi) + yyi q(i,y) h(xi,y) • Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average. • Makes use of information in entire hypothesis vector, not just prediction

AdaBoost.M2 Details • Extra init: wti,y = D(i) / (k-1) • For each iteration t = 1 to T • Wti = yyi wti,y • qt(i,y) = wti,y / Wti • Dt(i) = Wti / i=1..N Wti • WeakLearn gets D as well as q • Calculate t as shown above • t = t / (1 - t) • wti,y· t(0.5)(1 + ht(xi,yi) – ht(xi,y))

Error Bounds •   (k – 1) 2Tt=1..T sqrt(t(1 - t)) • Where  is traditional error and the t are pseudo-losses

Regression Setting • Instead of picking from a discrete set of output labels, choose a continuous value • More formally Y = [0, 1] • Minimize the mean squared error: • E[(h(x) – y)2] • Reduce to binary classification and use AdaBoost!

How it works (roughly) • For each example in training set, create continuum of associated instances xtilde(xi, y) where y [0, 1]. • Label is 1 if y  yi • Mapping to an infinite training set – need to convert discrete distributions to density functions

AdaBoost.R Bounds •   2T t=1..T sqrt(t (1 - t))

Conclusions • Starting from a on-line learning perspective, it is possible to generalize to boosting • Boosting can take weak learners and convert them to strong learners • This paper presented several algorithms to do boosting, with proofs of error bounds

On-line learning and Boosting

On-line learning and Boosting

Presentation Transcript

Machine Learning – Classifiers and Boosting

A Comparison of On-Line and Classroom Learning

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Collaborative Learning On-Line

Creativity and Collaboration: An on-line learning experience

On-line Learning Focus Groups

Boosting Y our Bottom Line

Moodle Basics and On-line Learning

CGIAR On-line Learning Resources (OLR)

On-line Service Coordination Learning Module

Learning Objects and the On-line Composition Course

On-Line Work-Based Learning

On-Line Learning

Section 2: On-line Learning

Find And Use On-Line Information and Learning Resources

On-line Learning Focus Groups

On-line learning and Boosting

HCI for on-line Learning

Machine Learning – Classifiers and Boosting