Boosting l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Boosting PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on
  • Presentation posted in: General

Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.

Download Presentation

Boosting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Boosting l.jpg

Boosting

LING 572

Fei Xia

02/01/06


Outline l.jpg

Outline

  • Basic concepts

  • Theoretical validity

  • Case study:

    • POS tagging

  • Summary


Basic concepts l.jpg

Basic concepts


Overview of boosting l.jpg

Overview of boosting

  • Introduced by Schapire and Freund in 1990s.

  • “Boosting”: convert a weak learning algorithm into a strong one.

  • Main idea: Combine many weak classifiers to produce a powerful committee.

  • Algorithms:

    • AdaBoost: adaptive boosting

    • Gentle AdaBoost

    • BrownBoost


Bagging l.jpg

Bagging

ML

Random sample

with replacement

f1

ML

f2

f

ML

fT

Random sample

with replacement


Boosting6 l.jpg

Boosting

Weighted Sample

ML

f1

Training Sample

ML

Weighted Sample

f2

f

ML

fT


Main ideas l.jpg

Main ideas

  • Train a set of weak hypotheses: h1, …., hT.

  • The combined hypothesis H is a weighted majority vote of the T weak hypotheses.

    • Each hypothesis ht has a weight αt.

  • During the training, focus on the examples that are misclassified.

     At round t, example xi has the weight Dt(i).


Algorithm highlight l.jpg

Algorithm highlight

  • Training time: (h1, 1), …., (ht, t), …

  • Test time: for x,

    • Call each classifier ht, and calculate ht(x)

    • Calculate the sum: tt * ht(x)


Basic setting l.jpg

Basic Setting

  • Binary classification problem

  • Training data:

  • Dt(i): the weight of xi at round t. D1(i)=1/m.

  • A learner L that finds a weak hypothesis ht: X  Y given the training set and Dt

  • The error of a weak hypothesis ht:


The basic adaboost algorithm l.jpg

The basic AdaBoost algorithm

  • For t=1, …, T

  • Train weak learner ht : X  {-1, 1}using training data and Dt

  • Get the error rate:

  • Choose classifier weight:

  • Update the instance weights:


The new weights l.jpg

The new weights

When

When


An example l.jpg

An example

o

+

o

+

+


Two iterations l.jpg

Two iterations

Initial weights:

1st iteration:

2nd iteration:


The general adaboost algorithm l.jpg

The general AdaBoost algorithm


The basic and general algorithms l.jpg

The basic and general algorithms

  • In the basic algorithm, it can be proven that

  • The hypothesis weight αt is decided at round t

  • Di (The weight distribution of training examples) is updated at every round t.

  • Choice of weak learner:

    • its error should be less than 0.5:

    • Ex: DT (C4.5), decision stump


Experiment results freund and schapire 1996 l.jpg

Experiment results(Freund and Schapire, 1996)

Error rate on a set of 27 benchmark problems


Theoretical validity l.jpg

Theoretical validity


Training error of h x l.jpg

Training error of H(x)

Final hypothesis:

Training error is defined to be

It can be proved that training error


Training error for basic algorithm l.jpg

Training error for basic algorithm

Let

Training error

 Training error drops exponentially fast.


Generalization error expected test error l.jpg

Generalization error (expected test error)

  • Generalization error, with high probability, is at most

    T: the number of rounds of boosting

    m: the size of the sample

    d: VC-dimension of the base classifier space


Selecting weak hypotheses l.jpg

Selecting weak hypotheses

  • Training error

  • Choose ht that minimize Zt.

  • See “case study” for details.


Multiclass boosting l.jpg

Multiclass boosting


Two ways l.jpg

Two ways

  • Converting a multiclass problem to binary problem first:

    • One-vs-all

    • All-pairs

    • ECOC

  • Extending boosting directly

    • AdaBoost.M1

    • AdaBoost.M2  Prob 2 in Hw5


Case study l.jpg

Case study


Overview abney schapire and singer 1999 l.jpg

Overview(Abney, Schapire and Singer, 1999)

  • Boosting applied to Tagging and PP attachment

  • Issues:

    • How to learn weak hypotheses?

    • How to deal with multi-class problems?

    • Local decision vs. globally best sequence


Weak hypotheses l.jpg

Weak hypotheses

  • In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ:

    h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.

     h(x)=pΦ(x)

  • Examples:

    • POS tagging: Φ is “PreviousWord=the”

    • PP attachment: Φ is “V=accused, N1=president, P=of”

  • Choosing a list of hypotheses  choosing a list of features.


Finding weak hypotheses l.jpg

Finding weak hypotheses

  • The training error of the combined hypothesis is at most

    where

     choose ht that minimizes Zt.

  • ht corresponds to a (Φt, p0, p1) tuple.


Slide28 l.jpg

  • Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when

where


Finding weak hypotheses cont l.jpg

Finding weak hypotheses (cont)

  • For each Φ, calculate Zt

    Choose the one with min Zt.


Boosting results on pos tagging l.jpg

Boosting results on POS tagging?


Sequential model l.jpg

Sequential model

  • Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.


Previous results l.jpg

Previous results


Summary l.jpg

Summary


Main ideas34 l.jpg

Main ideas

  • Boosting combines many weak classifiers to produce a powerful committee.

  • Base learning algorithms that only need to be better than random.

  • The instance weights are updated during training to put more emphasis on hard examples.


Strengths of adaboost l.jpg

Strengths of AdaBoost

  • Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error)

  • It performs well on many tasks.

  • It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.


Weakness of adaboost l.jpg

Weakness of AdaBoost

  • The actual performance of boosting depends on the data and the base learner.

  • Boosting seems to be especially susceptible to noise.

  • When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.

     “Gentle AdaBoost”, “BrownBoost”


Other properties l.jpg

Other properties

  • Simplicity (conceptual)

  • Efficiency at training

  • Efficiency at testing time

  • Handling multi-class

  • Interpretability


Bagging vs boosting freund and schapire 1996 l.jpg

Bagging vs. Boosting (Freund and Schapire 1996)

  • Bagging always uses resampling rather than reweighting.

  • Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution

  • In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses


Relation to other topics l.jpg

Relation to other topics

  • Game theory

  • Linear programming

  • Bregman distances

  • Support-vector machines

  • Brownian motion

  • Logistic regression

  • Maximum-entropy methods such as iterative scaling.


Additional slides l.jpg

Additional slides


Sources of bias and variance l.jpg

Sources of Bias and Variance

  • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data

  • Variance arises when the classifier overfits the data

  • There is often a tradeoff between bias and variance


Effect of bagging l.jpg

Effect of Bagging

  • If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias.

  • In practice, bagging can reduce both bias and variance

    • For high-bias classifiers, it can reduce bias

    • For high-variance classifiers, it can reduce variance


Effect of boosting l.jpg

Effect of Boosting

  • In the early iterations, boosting is primary a bias-reducing method

  • In later iterations, it appears to be primarily a variance-reducing method


How to choose t for h t with range 1 1 l.jpg

How to choose αt for ht with range [-1,1]?

  • Training error

  • Choose αt that minimize Zt.


Issues l.jpg

Issues

  • Given ht, how to choose αt?

  • How to select ht?


How to choose t when h t has range 1 1 l.jpg

How to choose αt when ht has range {-1,1}?


  • Login