- By
**kolya** - Follow User

- 105 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Machine Learning' - kolya

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Boosting

Machine Learning

CSE 5095: Special Topics Course

Boosting

Nhan Nguyen

Computer Science and Engineering Dept.

Boosting

- Method for converting rules of thumb into a prediction rule.
- Rule of thumb?
- Method?

Binary Classification

- X: set of all possible instances or examples.

- e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast

- c: X {0,1}: the target concept to learn.

- e.g., c: EnjoySport {0,1}

- H: set of concept hypotheses

- e.g., conjunctions of literals: <?,Cold,High,?,?,?>

- C: concept class, a set of target concept c.
- D: target distribution, a fixed probability distribution over X. Training and test examples are drawn according to D.

Binary Classification

- S: training sample

<x1,c(x1)>,…,<xm,c(xm)>

- The learning algorithm receives sample S and selects a hypothesis from H approximating c.

- Find a hypothesis hH such that h(x) = c(x) x S

Errors

- True error or generalization error of h with respect to the target concept c and distribution D:

[h] =

- Empirical error: average error of h on the training sample S drawn according to distribution D,

[h] = =

Errors

- Questions:
- Can we bound the true error of a hypothesis given only its training error?
- How many examples are needed for a good approximation?

Approximate Concept Learning

- Requiring a learner to acquire the right concept is too strict
- Instead, we will allow the learner to produce a good approximation to the actual concept

General Assumptions

- Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)
- Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).
- Goal: h should have a low error rate on new examples drawn from the same distribution D.

[h] =

PAC learning model

- PAC learning: Probably Approximately Correct learning
- The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to cand we want to have small error of h [h]
- If [h] is small, h is “probably approximately correct”.
- Formally, h is PAC if

Pr[[h] 1 -

for all c C, > 0, > 0, and all distributions D

PAC learning model

Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:

Pr[[h] 1 -

uses at most poly(1/1/, size(X), size(c)) examples and running time.

: accuracy, 1 - : confidence.

Such an L is called a strong Learner

PAC learning model

- Learner L is a weak learner if learner L output a hypothesis h H such that:

Pr[[h] ( - 1 -

for all c C, > 0, > 0, and all distributions D

- A weak learner only output an hypothesis that performs slightly better than random guessing
- Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?
- Rule of thumb ~ weak leaner
- Method ~

Boosting a weak learner – Majority Vote

- L leans on first N training points
- L randomly filters the next batch of training points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.
- L builds a third training set of N points for which h1 and h2 disagree, and produces h3.
- L outputs h = Majority Vote(h1, h2, h3)

Boosting [Schapire ’89]

Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote.

A formal description of boosting:

- Given training set ((…, (
- {-1, +1}: correct label of X
- for t = 1, …, T:
- construct distribution on {1, …, m}
- find weak hypothesis : X {-1, +1} with small error on
- output final hypothesis

Boosting

Training Sample

(x)

Weighted Sample

(x)

Final hypothesis

H(x) = sign[]

Weighted Sample

(x)

.

.

.

Weighted Sample

(x)

Boosting algorithms

- AdaBoost(Adaptive Boosting)
- LPBoost (Linear Programming Boosting)
- BrownBoost
- MadaBoost (modifying the weighting system of AdaBoost)
- LogitBoost

Lecture

- Motivating example
- Adaboost
- Training Error
- Overfitting
- Generalization Error
- Examples of Adaboost
- Multiclass for weak learner

Theorem: 2 Class Error Bounds

- Assume t = - t
- = error rate on round of boosting
- = how much better than random guessing
- small, positive number
- Training error is bounded by

Hfinal

Implications?

- = number of rounds of boosting
- and do not need to be known in advance
- As long as then the training error will decrease exponentially as a function of

Part III: Continued

Plug in the definition tt

Exponential Bound

- Use property
- Take x to be
- 1

Putting it together

- We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined
- Bound is pretty loose

Example:

- Suppose that all are at least 10% so that no has an error rate above 40%
- What upper bound does the theorem place on the training error?
- Answer:

Overfitting?

- Does the proof say anything about overfitting?
- While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

AymanAlharbi

Example (Spam emails)

* problem: filter out spam(junk email)

- Gather large collection of examples of spam and non-spam

From: Jinbo@engr.uconn.edu

“can you review a paper” ...

non-spam

From: XYZ@F.U

“Win 10000$ easily !!” ...

spam

Example (Spam emails)

If ‘buy now’ occurs in message, then predict ‘spam’

Main Observation

- Easyto find “rules of thumb” that are “often” correct

- Hard to find single rule that is very highly accurate

Goal: automatically categorize type of call requested by phone customer

(Collect, CallingCard, PersonToPerson, etc.)

- Yes I’d like to place a collect call long distance please

Main Observation

(Collect)

Easyto find

“rules of thumb”

that are “often” correct

- operator I need to make a call but I need to billit to my

office

(ThirdNumber)

If ‘bill’ occurs in

utterance, then predict

‘BillingCredit’

- I just called the wrong and I would like to have that taken off of my bill

(BillingCredit)

Hard to find single rule that is very highly accurate

The Boosting Approach

- Devise computer program for deriving rough rules of thumb
- Apply procedure to subset of emails
- Obtain rule of thumb
- Apply to 2nd subset of emails
- Obtain 2nd rule of thumb
- Repeat T times

Details

- How to choose examples on each round?

- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

- How to combinerules of thumb into single prediction rule?

- Take (weighted) majority vote of rules of thumb !!

- Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?

Idea

• At each iteration t :

– Weight each training example by how incorrectly

it was classified

– Learn a hypothesis – ht

– Choose a strength for this hypothesis – αt

Final classifier: weighted combination of

weak learners

Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

Boosting Overview

- Goal: Form one strong classifier from multiple weak classifiers.
- Proceeds in rounds iteratively producing classifiers from weak learner.
- Increases weight given to incorrectly classified examples.
- Gives importance to classifier that is inversely proportional to its weighted error.
- Each classifier gets a vote based on its importance.

Initialize

- Initialize with evenly weighted distribution
- Begin generating classifiers

Error

- Quality of classifier based on weighted error:
- Probability ht will misclassify an example selected according distribution Dt
- Or summation of the weights of all misclassified examples

Classifier Importance

- αt measures the importance given to classifier ht
- αt > 0 if ε t < ( εt assumed to always be < )
- αtis inversely proportional to ε t

Update Distribution

- Increase weight of misclassified examples
- Decrease weight of correctly classified examples

Combine Classifiers

- When classifying a new instance x all of the weak classifiers get a vote weighted by their α

Machine Learning

SCE 5095: Special Topics Course

Instructor: Jinbo Bi

Computer Science and Engineering Dept.

Presenter: Brian McClanahan

Topic: Boosting Generalization Error

Generalization Error

- Generalization error is the true error of a classifier
- Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error
- For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound

Generalization Error First Bound

- empirical risk (training error)
- – boosting rounds
- – VC Dimension of base classifiers
- – number of training examples
- - generalization error

Intuition of Bound: Hoeffding’s inequality

- Define to be a finite set of hypothesis which map examples to 0 or 1
- Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:
- In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .

Intuition of Bound

If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .

So

will hold with probability

Intuition of Bound:Bounding all hypotheses in set

- So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis
- How do we bound the difference for all hypotheses in

Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .

Again by setting and solving for we have

Will hold with probability

Intuition of Bound:Bounding Hypotheses in Infinite Set

- What about cases when is infinite
- Even if is infinite given a set of examples the hypothesis in may only be capable of labeling the examples in a number of ways <
- This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes
- By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants

Intuition of Bound:Bounding Hypotheses in Infinite Set

- More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of
- Also define the growth function to be the function which measures the maximum number of dichotomies for any sample of size

Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any

and with probability at least

For all

Intuition of Bound:Bounding Hypotheses in Infinite Set

- It turns out that the growth function is either polynomial in m or
- In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of
- In the case when the grown function is the VC dimension if infinite
- VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points

Intuition of Bound:Bounding Hypotheses in Infinite Set

- The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples

Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m

when

Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any

Thus with probability at least

Intuition of Bound: Adaboost Generalization Error

- The first bound for the Adaboostgeneralization error follows from Saucer’s Lemma in a similar way
- The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations
- Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error

Generalization Error First Bound

- Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.
- This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.
- Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.

Example

- The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.
- Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease

Test

C4.5 test error

Error

Training

Rounds

Margin

- AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made
- This notion of confidence is quantified by the margin
- The margin takes values between 1 and -1
- The magnitude of the margin can be viewed as a measure of confidence

Generalization Error

- In response to empirical findings Schapire et al. derived a new bound.
- This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error
- This bound suggests that higher margins are preferable for lower generalization error

Relation to Support Vector Machines

- The boosting margins theory turns out to have a strong connection with support vector machines
- Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them
- The optimal weights would be the weights that maximize the minimum margin

Relation to Support Vector Machines

- Both support vector machines and boosting can be seen as trying to optimize the same objective function
- Both attempt to maximize the minimum margin
- The difference is the norms used by Boosting and SVM

Boosting Norms

SVM Norms

Relation to Support Vector Machines

- Effects of different norms
- Different norms can lead to very different results, especially in high dimensional spaces
- Different computation requirements
- SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming
- Difference in finding linear classifiers for high dimensional spaces
- SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions
- Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels

AdaBoost Examples and Results

SCE 5095: Special Topics Course

Yousra Almathami

Computer Science and Engineering Dept.

The Rules for Boosting

- Set all weights of training examples equal
- Train a weak learner on the weighted examples
- Check how well the weak learner performs on data and give it a weight based on how well it did
- Re-weight training examples and repeat
- When done, predict by voting by majority

Overview of Adaboost

Taken from Bishop

Toy Example

5 Positive examples

5 Negative examples

2-Dimensional plane

Weak hyps: linear separators

3 iterations

All given equal weights

Taken from Schapire

Final Classifier learned by Boosting

Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.

Taken from Schapire

Boosting Demo

Online Demo taken from www.Mathworks.com by Richard Stapenhurst

The Idea

- Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?
- What changes in the problem?
- y = {-1,+1} → y = {1, 2, …, k}
- Random guess value changes from ½ to 1/k
- Weak learning classifiers need to be updated
- Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?
- There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible
- THIS IS TOO WEAK!

AdaBoost.M1

- Almost the same algorithm as regular AdaBoost
- Advantage:
- Works similar to binary AdaBoost but on multiclass problems
- Disadvantage:
- If weak hypothesis has error slightly better than ½ then boosting is possible
- For k = 2, slightly better than ½ represents a better than random guess, what about k > 2?
- TOO STRONG! (unless weak learner is strong)

An Alternative Approach

- Can we create multiple binary problems out of a multiclass problem?
- For example xi is the correct label yi or y`
- K – 1 binary problems for each example
- h(x,y) = 1 if y is the label for x, 0 otherwise
- h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
- h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
- h(xi, yi)=h(xi, y`) → uninformative

An Alternative Approach

- Can we create multiple binary problems out of a multiclass problem?
- For example xi is the correct label yi or y`
- K – 1 binary problems for each example
- h(x,y) = 1 if y is the label for x, 0 otherwise
- h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
- h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
- h(xi, yi)=h(xi, y`) → uninformative

AdaBoost.MR

- Generalized to allow multiple labels per example
- Different initial distribution
- ht : X x Y→ Real Number
- ht used to rank labels for a given example
- Now have ranking loss instead of error rate

Additional Algorithms

- AdaBoost.MH
- One-against-all
- Requires strong weak learning conditions
- AdaBoost.MO
- Runs MH as part of algorithm and uses strong classifier to generate alternative strong classifiers which can perform and extra voting step
- Still requires a strong weak learning condition
- SAMME
- Allows weak learners slightly better than random, Cost matrix instead of weights and a different equation for weak classifier combination
- Conditions can be too weak for strong margins

Take Home

- Yes, it is possible
- There are many multiclass boosting algorithms available
- No, there is no 'one size fits all' multiclass algorithm

Download Presentation

Connecting to Server..