boosting
Download
Skip this Video
Download Presentation
Boosting

Loading in 2 Seconds...

play fullscreen
1 / 46

Boosting - PowerPoint PPT Presentation


  • 245 Views
  • Uploaded on

Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Boosting' - terrene


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
boosting

Boosting

LING 572

Fei Xia

02/01/06

outline
Outline
  • Basic concepts
  • Theoretical validity
  • Case study:
    • POS tagging
  • Summary
overview of boosting
Overview of boosting
  • Introduced by Schapire and Freund in 1990s.
  • “Boosting”: convert a weak learning algorithm into a strong one.
  • Main idea: Combine many weak classifiers to produce a powerful committee.
  • Algorithms:
    • AdaBoost: adaptive boosting
    • Gentle AdaBoost
    • BrownBoost
bagging
Bagging

ML

Random sample

with replacement

f1

ML

f2

f

ML

fT

Random sample

with replacement

boosting6
Boosting

Weighted Sample

ML

f1

Training Sample

ML

Weighted Sample

f2

f

ML

fT

main ideas
Main ideas
  • Train a set of weak hypotheses: h1, …., hT.
  • The combined hypothesis H is a weighted majority vote of the T weak hypotheses.
    • Each hypothesis ht has a weight αt.
  • During the training, focus on the examples that are misclassified.

 At round t, example xi has the weight Dt(i).

algorithm highlight
Algorithm highlight
  • Training time: (h1, 1), …., (ht, t), …
  • Test time: for x,
    • Call each classifier ht, and calculate ht(x)
    • Calculate the sum: tt * ht(x)
basic setting
Basic Setting
  • Binary classification problem
  • Training data:
  • Dt(i): the weight of xi at round t. D1(i)=1/m.
  • A learner L that finds a weak hypothesis ht: X  Y given the training set and Dt
  • The error of a weak hypothesis ht:
the basic adaboost algorithm
The basic AdaBoost algorithm
  • For t=1, …, T
  • Train weak learner ht : X  {-1, 1}using training data and Dt
  • Get the error rate:
  • Choose classifier weight:
  • Update the instance weights:
the new weights
The new weights

When

When

two iterations
Two iterations

Initial weights:

1st iteration:

2nd iteration:

the basic and general algorithms
The basic and general algorithms
  • In the basic algorithm, it can be proven that
  • The hypothesis weight αt is decided at round t
  • Di (The weight distribution of training examples) is updated at every round t.
  • Choice of weak learner:
    • its error should be less than 0.5:
    • Ex: DT (C4.5), decision stump
experiment results freund and schapire 1996
Experiment results(Freund and Schapire, 1996)

Error rate on a set of 27 benchmark problems

training error of h x
Training error of H(x)

Final hypothesis:

Training error is defined to be

It can be proved that training error

training error for basic algorithm
Training error for basic algorithm

Let

Training error

 Training error drops exponentially fast.

generalization error expected test error
Generalization error (expected test error)
  • Generalization error, with high probability, is at most

T: the number of rounds of boosting

m: the size of the sample

d: VC-dimension of the base classifier space

selecting weak hypotheses
Selecting weak hypotheses
  • Training error
  • Choose ht that minimize Zt.
  • See “case study” for details.
two ways
Two ways
  • Converting a multiclass problem to binary problem first:
    • One-vs-all
    • All-pairs
    • ECOC
  • Extending boosting directly
    • AdaBoost.M1
    • AdaBoost.M2  Prob 2 in Hw5
overview abney schapire and singer 1999
Overview(Abney, Schapire and Singer, 1999)
  • Boosting applied to Tagging and PP attachment
  • Issues:
    • How to learn weak hypotheses?
    • How to deal with multi-class problems?
    • Local decision vs. globally best sequence
weak hypotheses
Weak hypotheses
  • In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ:

h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.

 h(x)=pΦ(x)

  • Examples:
    • POS tagging: Φ is “PreviousWord=the”
    • PP attachment: Φ is “V=accused, N1=president, P=of”
  • Choosing a list of hypotheses  choosing a list of features.
finding weak hypotheses
Finding weak hypotheses
  • The training error of the combined hypothesis is at most

where

 choose ht that minimizes Zt.

  • ht corresponds to a (Φt, p0, p1) tuple.
finding weak hypotheses cont
Finding weak hypotheses (cont)
  • For each Φ, calculate Zt

Choose the one with min Zt.

sequential model
Sequential model
  • Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.
main ideas34
Main ideas
  • Boosting combines many weak classifiers to produce a powerful committee.
  • Base learning algorithms that only need to be better than random.
  • The instance weights are updated during training to put more emphasis on hard examples.
strengths of adaboost
Strengths of AdaBoost
  • Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error)
  • It performs well on many tasks.
  • It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.
weakness of adaboost
Weakness of AdaBoost
  • The actual performance of boosting depends on the data and the base learner.
  • Boosting seems to be especially susceptible to noise.
  • When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.

 “Gentle AdaBoost”, “BrownBoost”

other properties
Other properties
  • Simplicity (conceptual)
  • Efficiency at training
  • Efficiency at testing time
  • Handling multi-class
  • Interpretability
bagging vs boosting freund and schapire 1996
Bagging vs. Boosting (Freund and Schapire 1996)
  • Bagging always uses resampling rather than reweighting.
  • Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution
  • In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses
relation to other topics
Relation to other topics
  • Game theory
  • Linear programming
  • Bregman distances
  • Support-vector machines
  • Brownian motion
  • Logistic regression
  • Maximum-entropy methods such as iterative scaling.
sources of bias and variance
Sources of Bias and Variance
  • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data
  • Variance arises when the classifier overfits the data
  • There is often a tradeoff between bias and variance
effect of bagging
Effect of Bagging
  • If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias.
  • In practice, bagging can reduce both bias and variance
    • For high-bias classifiers, it can reduce bias
    • For high-variance classifiers, it can reduce variance
effect of boosting
Effect of Boosting
  • In the early iterations, boosting is primary a bias-reducing method
  • In later iterations, it appears to be primarily a variance-reducing method
how to choose t for h t with range 1 1
How to choose αt for ht with range [-1,1]?
  • Training error
  • Choose αt that minimize Zt.

issues
Issues
  • Given ht, how to choose αt?
  • How to select ht?
ad