Boosting

Boosting LING 572 Fei Xia 02/01/06

Outline • Basic concepts • Theoretical validity • Case study: • POS tagging • Summary

Basic concepts

Overview of boosting • Introduced by Schapire and Freund in 1990s. • “Boosting”: convert a weak learning algorithm into a strong one. • Main idea: Combine many weak classifiers to produce a powerful committee. • Algorithms: • AdaBoost: adaptive boosting • Gentle AdaBoost • BrownBoost • …

Bagging ML Random sample with replacement f1 ML f2 f ML fT Random sample with replacement

Boosting Weighted Sample ML f1 Training Sample ML Weighted Sample f2 f … ML fT

Main ideas • Train a set of weak hypotheses: h1, …., hT. • The combined hypothesis H is a weighted majority vote of the T weak hypotheses. • Each hypothesis ht has a weight αt. • During the training, focus on the examples that are misclassified.  At round t, example xi has the weight Dt(i).

Algorithm highlight • Training time: (h1, 1), …., (ht, t), … • Test time: for x, • Call each classifier ht, and calculate ht(x) • Calculate the sum: tt * ht(x)

Basic Setting • Binary classification problem • Training data: • Dt(i): the weight of xi at round t. D1(i)=1/m. • A learner L that finds a weak hypothesis ht: X  Y given the training set and Dt • The error of a weak hypothesis ht:

The basic AdaBoost algorithm • For t=1, …, T • Train weak learner ht : X  {-1, 1}using training data and Dt • Get the error rate: • Choose classifier weight: • Update the instance weights:

The new weights When When

An example o + o + +

Two iterations Initial weights: 1st iteration: 2nd iteration:

The general AdaBoost algorithm

The basic and general algorithms • In the basic algorithm, it can be proven that • The hypothesis weight αt is decided at round t • Di (The weight distribution of training examples) is updated at every round t. • Choice of weak learner: • its error should be less than 0.5: • Ex: DT (C4.5), decision stump

Experiment results(Freund and Schapire, 1996) Error rate on a set of 27 benchmark problems

Theoretical validity

Training error of H(x) Final hypothesis: Training error is defined to be It can be proved that training error

Training error for basic algorithm Let Training error  Training error drops exponentially fast.

Generalization error (expected test error) • Generalization error, with high probability, is at most T: the number of rounds of boosting m: the size of the sample d: VC-dimension of the base classifier space

Selecting weak hypotheses • Training error • Choose ht that minimize Zt. • See “case study” for details.

Multiclass boosting

Two ways • Converting a multiclass problem to binary problem first: • One-vs-all • All-pairs • ECOC • Extending boosting directly • AdaBoost.M1 • AdaBoost.M2  Prob 2 in Hw5

Case study

Overview(Abney, Schapire and Singer, 1999) • Boosting applied to Tagging and PP attachment • Issues: • How to learn weak hypotheses? • How to deal with multi-class problems? • Local decision vs. globally best sequence

Weak hypotheses • In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ: h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.  h(x)=pΦ(x) • Examples: • POS tagging: Φ is “PreviousWord=the” • PP attachment: Φ is “V=accused, N1=president, P=of” • Choosing a list of hypotheses  choosing a list of features.

Finding weak hypotheses • The training error of the combined hypothesis is at most where  choose ht that minimizes Zt. • ht corresponds to a (Φt, p0, p1) tuple.

Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when where

Finding weak hypotheses (cont) • For each Φ, calculate Zt Choose the one with min Zt.

Boosting results on POS tagging?

Sequential model • Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.

Previous results

Summary

Main ideas • Boosting combines many weak classifiers to produce a powerful committee. • Base learning algorithms that only need to be better than random. • The instance weights are updated during training to put more emphasis on hard examples.

Strengths of AdaBoost • Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error) • It performs well on many tasks. • It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

Weakness of AdaBoost • The actual performance of boosting depends on the data and the base learner. • Boosting seems to be especially susceptible to noise. • When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.  “Gentle AdaBoost”, “BrownBoost”

Other properties • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Interpretability

Bagging vs. Boosting (Freund and Schapire 1996) • Bagging always uses resampling rather than reweighting. • Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution • In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses

Relation to other topics • Game theory • Linear programming • Bregman distances • Support-vector machines • Brownian motion • Logistic regression • Maximum-entropy methods such as iterative scaling.

Additional slides

Sources of Bias and Variance • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data • Variance arises when the classifier overfits the data • There is often a tradeoff between bias and variance

Effect of Bagging • If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias. • In practice, bagging can reduce both bias and variance • For high-bias classifiers, it can reduce bias • For high-variance classifiers, it can reduce variance

Effect of Boosting • In the early iterations, boosting is primary a bias-reducing method • In later iterations, it appears to be primarily a variance-reducing method

How to choose αt for ht with range [-1,1]? • Training error • Choose αt that minimize Zt. 

Issues • Given ht, how to choose αt? • How to select ht?

How to choose αt when ht has range {-1,1}?

Boosting

Boosting

Presentation Transcript

Boosting

Boosting Business

Boosting

boosting

Boosting

Boosting Retention :

Boosting

BATTERY BOOSTING

Boosting

Boosting

Boosting scores

Boosting

Boosting

Csgo boosting

overwatch boosting

overwatch boosting

Csgo Boosting

Elo Boosting

Boosting

Boosting

Valor Boosting

csgo boosting