- 243 Views
- Updated On :

Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.

Related searches for Boosting

Download Presentation
## PowerPoint Slideshow about 'Boosting' - terrene

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Basic concepts
- Theoretical validity
- Case study:
- POS tagging

- Summary

Overview of boosting

- Introduced by Schapire and Freund in 1990s.
- “Boosting”: convert a weak learning algorithm into a strong one.
- Main idea: Combine many weak classifiers to produce a powerful committee.
- Algorithms:
- AdaBoost: adaptive boosting
- Gentle AdaBoost
- BrownBoost
- …

Main ideas

- Train a set of weak hypotheses: h1, …., hT.
- The combined hypothesis H is a weighted majority vote of the T weak hypotheses.
- Each hypothesis ht has a weight αt.

- During the training, focus on the examples that are misclassified.
At round t, example xi has the weight Dt(i).

Algorithm highlight

- Training time: (h1, 1), …., (ht, t), …
- Test time: for x,
- Call each classifier ht, and calculate ht(x)
- Calculate the sum: tt * ht(x)

Basic Setting

- Binary classification problem
- Training data:
- Dt(i): the weight of xi at round t. D1(i)=1/m.
- A learner L that finds a weak hypothesis ht: X Y given the training set and Dt
- The error of a weak hypothesis ht:

The basic AdaBoost algorithm

- For t=1, …, T
- Train weak learner ht : X {-1, 1}using training data and Dt
- Get the error rate:
- Choose classifier weight:
- Update the instance weights:

The general AdaBoost algorithm

The basic and general algorithms

- In the basic algorithm, it can be proven that
- The hypothesis weight αt is decided at round t
- Di (The weight distribution of training examples) is updated at every round t.
- Choice of weak learner:
- its error should be less than 0.5:
- Ex: DT (C4.5), decision stump

Experiment results(Freund and Schapire, 1996)

Error rate on a set of 27 benchmark problems

Training error of H(x)

Final hypothesis:

Training error is defined to be

It can be proved that training error

Generalization error (expected test error)

- Generalization error, with high probability, is at most
T: the number of rounds of boosting

m: the size of the sample

d: VC-dimension of the base classifier space

Selecting weak hypotheses

- Training error
- Choose ht that minimize Zt.
- See “case study” for details.

Two ways

- Converting a multiclass problem to binary problem first:
- One-vs-all
- All-pairs
- ECOC

- Extending boosting directly
- AdaBoost.M1
- AdaBoost.M2 Prob 2 in Hw5

Overview(Abney, Schapire and Singer, 1999)

- Boosting applied to Tagging and PP attachment
- Issues:
- How to learn weak hypotheses?
- How to deal with multi-class problems?
- Local decision vs. globally best sequence

Weak hypotheses

- In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ:
h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.

h(x)=pΦ(x)

- Examples:
- POS tagging: Φ is “PreviousWord=the”
- PP attachment: Φ is “V=accused, N1=president, P=of”

- Choosing a list of hypotheses choosing a list of features.

Finding weak hypotheses

- The training error of the combined hypothesis is at most
where

choose ht that minimizes Zt.

- ht corresponds to a (Φt, p0, p1) tuple.

- Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when

where

Finding weak hypotheses (cont)

- For each Φ, calculate Zt
Choose the one with min Zt.

Sequential model

- Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.

Main ideas

- Boosting combines many weak classifiers to produce a powerful committee.
- Base learning algorithms that only need to be better than random.
- The instance weights are updated during training to put more emphasis on hard examples.

Strengths of AdaBoost

- Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error)
- It performs well on many tasks.
- It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

Weakness of AdaBoost

- The actual performance of boosting depends on the data and the base learner.
- Boosting seems to be especially susceptible to noise.
- When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.
“Gentle AdaBoost”, “BrownBoost”

Other properties

- Simplicity (conceptual)
- Efficiency at training
- Efficiency at testing time
- Handling multi-class
- Interpretability

Bagging vs. Boosting (Freund and Schapire 1996)

- Bagging always uses resampling rather than reweighting.
- Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution
- In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses

Relation to other topics

- Game theory
- Linear programming
- Bregman distances
- Support-vector machines
- Brownian motion
- Logistic regression
- Maximum-entropy methods such as iterative scaling.

Sources of Bias and Variance

- Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data
- Variance arises when the classifier overfits the data
- There is often a tradeoff between bias and variance

Effect of Bagging

- If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias.
- In practice, bagging can reduce both bias and variance
- For high-bias classifiers, it can reduce bias
- For high-variance classifiers, it can reduce variance

Effect of Boosting

- In the early iterations, boosting is primary a bias-reducing method
- In later iterations, it appears to be primarily a variance-reducing method

Issues

- Given ht, how to choose αt?
- How to select ht?

How to choose αt when ht has range {-1,1}?

Download Presentation

Connecting to Server..