- 181 Views
- Uploaded on
- Presentation posted in: General

Boosting

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Boosting

LING 572

Fei Xia

02/01/06

- Basic concepts
- Theoretical validity
- Case study:
- POS tagging

- Summary

Basic concepts

- Introduced by Schapire and Freund in 1990s.
- “Boosting”: convert a weak learning algorithm into a strong one.
- Main idea: Combine many weak classifiers to produce a powerful committee.
- Algorithms:
- AdaBoost: adaptive boosting
- Gentle AdaBoost
- BrownBoost
- …

ML

Random sample

with replacement

f1

ML

f2

f

ML

fT

Random sample

with replacement

Weighted Sample

ML

f1

Training Sample

ML

Weighted Sample

f2

f

…

ML

fT

- Train a set of weak hypotheses: h1, …., hT.
- The combined hypothesis H is a weighted majority vote of the T weak hypotheses.
- Each hypothesis ht has a weight αt.

- During the training, focus on the examples that are misclassified.
At round t, example xi has the weight Dt(i).

- Training time: (h1, 1), …., (ht, t), …
- Test time: for x,
- Call each classifier ht, and calculate ht(x)
- Calculate the sum: tt * ht(x)

- Binary classification problem
- Training data:
- Dt(i): the weight of xi at round t. D1(i)=1/m.
- A learner L that finds a weak hypothesis ht: X Y given the training set and Dt
- The error of a weak hypothesis ht:

- For t=1, …, T
- Train weak learner ht : X {-1, 1}using training data and Dt
- Get the error rate:
- Choose classifier weight:
- Update the instance weights:

When

When

o

+

o

+

+

Initial weights:

1st iteration:

2nd iteration:

- In the basic algorithm, it can be proven that
- The hypothesis weight αt is decided at round t
- Di (The weight distribution of training examples) is updated at every round t.
- Choice of weak learner:
- its error should be less than 0.5:
- Ex: DT (C4.5), decision stump

Error rate on a set of 27 benchmark problems

Theoretical validity

Final hypothesis:

Training error is defined to be

It can be proved that training error

Let

Training error

Training error drops exponentially fast.

- Generalization error, with high probability, is at most
T: the number of rounds of boosting

m: the size of the sample

d: VC-dimension of the base classifier space

- Training error
- Choose ht that minimize Zt.
- See “case study” for details.

Multiclass boosting

- Converting a multiclass problem to binary problem first:
- One-vs-all
- All-pairs
- ECOC

- Extending boosting directly
- AdaBoost.M1
- AdaBoost.M2 Prob 2 in Hw5

Case study

- Boosting applied to Tagging and PP attachment
- Issues:
- How to learn weak hypotheses?
- How to deal with multi-class problems?
- Local decision vs. globally best sequence

- In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ:
h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.

h(x)=pΦ(x)

- Examples:
- POS tagging: Φ is “PreviousWord=the”
- PP attachment: Φ is “V=accused, N1=president, P=of”

- Choosing a list of hypotheses choosing a list of features.

- The training error of the combined hypothesis is at most
where

choose ht that minimizes Zt.

- ht corresponds to a (Φt, p0, p1) tuple.

- Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when

where

- For each Φ, calculate Zt
Choose the one with min Zt.

- Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.

Summary

- Boosting combines many weak classifiers to produce a powerful committee.
- Base learning algorithms that only need to be better than random.
- The instance weights are updated during training to put more emphasis on hard examples.

- Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error)
- It performs well on many tasks.
- It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

- The actual performance of boosting depends on the data and the base learner.
- Boosting seems to be especially susceptible to noise.
- When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.
“Gentle AdaBoost”, “BrownBoost”

- Simplicity (conceptual)
- Efficiency at training
- Efficiency at testing time
- Handling multi-class
- Interpretability

- Bagging always uses resampling rather than reweighting.
- Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution
- In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses

- Game theory
- Linear programming
- Bregman distances
- Support-vector machines
- Brownian motion
- Logistic regression
- Maximum-entropy methods such as iterative scaling.

Additional slides

- Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data
- Variance arises when the classifier overfits the data
- There is often a tradeoff between bias and variance

- If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias.
- In practice, bagging can reduce both bias and variance
- For high-bias classifiers, it can reduce bias
- For high-variance classifiers, it can reduce variance

- In the early iterations, boosting is primary a bias-reducing method
- In later iterations, it appears to be primarily a variance-reducing method

- Training error
- Choose αt that minimize Zt.

- Given ht, how to choose αt?
- How to select ht?