This presentation is the property of its rightful owner.
1 / 46

# Boosting PowerPoint PPT Presentation

Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.

## Related searches for Boosting

Boosting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Boosting

LING 572

Fei Xia

02/01/06

### Outline

• Basic concepts

• Theoretical validity

• Case study:

• POS tagging

• Summary

## Basic concepts

### Overview of boosting

• Introduced by Schapire and Freund in 1990s.

• “Boosting”: convert a weak learning algorithm into a strong one.

• Main idea: Combine many weak classifiers to produce a powerful committee.

• Algorithms:

• BrownBoost

ML

Random sample

with replacement

f1

ML

f2

f

ML

fT

Random sample

with replacement

Weighted Sample

ML

f1

Training Sample

ML

Weighted Sample

f2

f

ML

fT

### Main ideas

• Train a set of weak hypotheses: h1, …., hT.

• The combined hypothesis H is a weighted majority vote of the T weak hypotheses.

• Each hypothesis ht has a weight αt.

• During the training, focus on the examples that are misclassified.

 At round t, example xi has the weight Dt(i).

### Algorithm highlight

• Training time: (h1, 1), …., (ht, t), …

• Test time: for x,

• Call each classifier ht, and calculate ht(x)

• Calculate the sum: tt * ht(x)

### Basic Setting

• Binary classification problem

• Training data:

• Dt(i): the weight of xi at round t. D1(i)=1/m.

• A learner L that finds a weak hypothesis ht: X  Y given the training set and Dt

• The error of a weak hypothesis ht:

### The basic AdaBoost algorithm

• For t=1, …, T

• Train weak learner ht : X  {-1, 1}using training data and Dt

• Get the error rate:

• Choose classifier weight:

• Update the instance weights:

When

When

o

+

o

+

+

Initial weights:

1st iteration:

2nd iteration:

### The basic and general algorithms

• In the basic algorithm, it can be proven that

• The hypothesis weight αt is decided at round t

• Di (The weight distribution of training examples) is updated at every round t.

• Choice of weak learner:

• its error should be less than 0.5:

• Ex: DT (C4.5), decision stump

### Experiment results(Freund and Schapire, 1996)

Error rate on a set of 27 benchmark problems

## Theoretical validity

### Training error of H(x)

Final hypothesis:

Training error is defined to be

It can be proved that training error

### Training error for basic algorithm

Let

Training error

 Training error drops exponentially fast.

### Generalization error (expected test error)

• Generalization error, with high probability, is at most

T: the number of rounds of boosting

m: the size of the sample

d: VC-dimension of the base classifier space

### Selecting weak hypotheses

• Training error

• Choose ht that minimize Zt.

• See “case study” for details.

## Multiclass boosting

### Two ways

• Converting a multiclass problem to binary problem first:

• One-vs-all

• All-pairs

• ECOC

• Extending boosting directly

• AdaBoost.M2  Prob 2 in Hw5

## Case study

### Overview(Abney, Schapire and Singer, 1999)

• Boosting applied to Tagging and PP attachment

• Issues:

• How to learn weak hypotheses?

• How to deal with multi-class problems?

• Local decision vs. globally best sequence

### Weak hypotheses

• In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ:

h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.

 h(x)=pΦ(x)

• Examples:

• POS tagging: Φ is “PreviousWord=the”

• PP attachment: Φ is “V=accused, N1=president, P=of”

• Choosing a list of hypotheses  choosing a list of features.

### Finding weak hypotheses

• The training error of the combined hypothesis is at most

where

 choose ht that minimizes Zt.

• ht corresponds to a (Φt, p0, p1) tuple.

• Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when

where

### Finding weak hypotheses (cont)

• For each Φ, calculate Zt

Choose the one with min Zt.

### Sequential model

• Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.

## Summary

### Main ideas

• Boosting combines many weak classifiers to produce a powerful committee.

• Base learning algorithms that only need to be better than random.

• The instance weights are updated during training to put more emphasis on hard examples.

### Strengths of AdaBoost

• Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error)

• It performs well on many tasks.

• It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

### Weakness of AdaBoost

• The actual performance of boosting depends on the data and the base learner.

• Boosting seems to be especially susceptible to noise.

• When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.

 “Gentle AdaBoost”, “BrownBoost”

### Other properties

• Simplicity (conceptual)

• Efficiency at training

• Efficiency at testing time

• Handling multi-class

• Interpretability

### Bagging vs. Boosting (Freund and Schapire 1996)

• Bagging always uses resampling rather than reweighting.

• Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution

• In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses

### Relation to other topics

• Game theory

• Linear programming

• Bregman distances

• Support-vector machines

• Brownian motion

• Logistic regression

• Maximum-entropy methods such as iterative scaling.

### Sources of Bias and Variance

• Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data

• Variance arises when the classifier overfits the data

• There is often a tradeoff between bias and variance

### Effect of Bagging

• If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias.

• In practice, bagging can reduce both bias and variance

• For high-bias classifiers, it can reduce bias

• For high-variance classifiers, it can reduce variance

### Effect of Boosting

• In the early iterations, boosting is primary a bias-reducing method

• In later iterations, it appears to be primarily a variance-reducing method

### How to choose αt for ht with range [-1,1]?

• Training error

• Choose αt that minimize Zt.

### Issues

• Given ht, how to choose αt?

• How to select ht?