Machine Learning

1 / 88

# Machine Learning - PowerPoint PPT Presentation

Machine Learning. CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept. Boosting. Method for converting rules of thumb into a prediction rule. Rule of thumb ? Method?. Binary Classification. X: set of all possible instances or examples.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Machine Learning' - kolya

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Machine Learning

CSE 5095: Special Topics Course

Boosting

Nhan Nguyen

Computer Science and Engineering Dept.

Boosting
• Method for converting rules of thumb into a prediction rule.
• Rule of thumb?
• Method?
Binary Classification
• X: set of all possible instances or examples.

- e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast

• c: X {0,1}: the target concept to learn.

- e.g., c: EnjoySport {0,1}

• H: set of concept hypotheses

- e.g., conjunctions of literals: <?,Cold,High,?,?,?>

• C: concept class, a set of target concept c.
• D: target distribution, a fixed probability distribution over X. Training and test examples are drawn according to D.
Binary Classification
• S: training sample

<x1,c(x1)>,…,<xm,c(xm)>

• The learning algorithm receives sample S and selects a hypothesis from H approximating c.

- Find a hypothesis hH such that h(x) = c(x) x S

Errors
• True error or generalization error of h with respect to the target concept c and distribution D:

[h] =

• Empirical error: average error of h on the training sample S drawn according to distribution D,

[h] = =

Errors
• Questions:
• Can we bound the true error of a hypothesis given only its training error?
• How many examples are needed for a good approximation?
Approximate Concept Learning
• Requiring a learner to acquire the right concept is too strict
• Instead, we will allow the learner to produce a good approximation to the actual concept
General Assumptions
• Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)
• Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).
• Goal: h should have a low error rate on new examples drawn from the same distribution D.

[h] =

PAC learning model
• PAC learning: Probably Approximately Correct learning
• The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to cand we want to have small error of h [h]
• If [h] is small, h is “probably approximately correct”.
• Formally, h is PAC if

Pr[[h] 1 -

for all c C, > 0, > 0, and all distributions D

PAC learning model

Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:

Pr[[h] 1 -

uses at most poly(1/1/, size(X), size(c)) examples and running time.

: accuracy, 1 - : confidence.

Such an L is called a strong Learner

PAC learning model
• Learner L is a weak learner if learner L output a hypothesis h H such that:

Pr[[h] ( - 1 -

for all c C, > 0, > 0, and all distributions D

• A weak learner only output an hypothesis that performs slightly better than random guessing
• Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?
• Rule of thumb ~ weak leaner
• Method ~
Boosting a weak learner – Majority Vote
• L leans on first N training points
• L randomly filters the next batch of training points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.
• L builds a third training set of N points for which h1 and h2 disagree, and produces h3.
• L outputs h = Majority Vote(h1, h2, h3)
Boosting [Schapire ’89]

Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote.

A formal description of boosting:

• Given training set ((…, (
• {-1, +1}: correct label of X
• for t = 1, …, T:
• construct distribution on {1, …, m}
• find weak hypothesis : X {-1, +1} with small error on
• output final hypothesis
Boosting

Training Sample

(x)

Weighted Sample

(x)

Final hypothesis

H(x) = sign[]

Weighted Sample

(x)

.

.

.

Weighted Sample

(x)

Boosting algorithms
• LPBoost (Linear Programming Boosting)
• BrownBoost
• MadaBoost (modifying the weighting system of AdaBoost)
• LogitBoost
Lecture
• Motivating example
• Training Error
• Overfitting
• Generalization Error
• Examples of Adaboost
• Multiclass for weak learner
Machine Learning

Proof of Bound on Adaboost Training Error

Aaron Palmer

Theorem: 2 Class Error Bounds
• Assume t = - t
• = error rate on round of boosting
• = how much better than random guessing
• small, positive number
• Training error is bounded by

Hfinal

Implications?
• = number of rounds of boosting
• and do not need to be known in advance
• As long as then the training error will decrease exponentially as a function of
Proof Part II: training error

Training error () =

=

=

=

Proof Part III:

Set equal to zero and solve for

Plug back into

Part III: Continued

Plug in the definition tt

Exponential Bound
• Use property
• Take x to be
• 1
Putting it together
• We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined
• Bound is pretty loose
Example:
• Suppose that all are at least 10% so that no has an error rate above 40%
• What upper bound does the theorem place on the training error?
Overfitting?
• Does the proof say anything about overfitting?
• While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

### Boosting

AymanAlharbi

Example (Spam emails)

* problem: filter out spam(junk email)

- Gather large collection of examples of spam and non-spam

From: Jinbo@engr.uconn.edu

“can you review a paper” ...

non-spam

From: XYZ@F.U

“Win 10000\$ easily !!” ...

spam

Example (Spam emails)

If ‘buy now’ occurs in message, then predict ‘spam’

Main Observation

- Easyto find “rules of thumb” that are “often” correct

- Hard to find single rule that is very highly accurate

Example (Phone Cards)

Goal: automatically categorize type of call requested by phone customer

(Collect, CallingCard, PersonToPerson, etc.)

- Yes I’d like to place a collect call long distance please

Main Observation

(Collect)

Easyto find

“rules of thumb”

that are “often” correct

- operator I need to make a call but I need to billit to my

office

(ThirdNumber)

If ‘bill’ occurs in

utterance, then predict

‘BillingCredit’

- I just called the wrong and I would like to have that taken off of my bill

(BillingCredit)

Hard to find single rule that is very highly accurate

The Boosting Approach
• Devise computer program for deriving rough rules of thumb
• Apply procedure to subset of emails
• Obtain rule of thumb
• Apply to 2nd subset of emails
• Obtain 2nd rule of thumb
• Repeat T times
Details
• How to choose examples on each round?

- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

• How to combinerules of thumb into single prediction rule?

- Take (weighted) majority vote of rules of thumb !!

• Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?
Idea

• At each iteration t :

– Weight each training example by how incorrectly

it was classified

– Learn a hypothesis – ht

– Choose a strength for this hypothesis – αt

Final classifier: weighted combination of

weak learners

Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

Boosting

Presenter: Karl Severin

Computer Science and Engineering Dept.

Boosting Overview
• Goal: Form one strong classifier from multiple weak classifiers.
• Proceeds in rounds iteratively producing classifiers from weak learner.
• Increases weight given to incorrectly classified examples.
• Gives importance to classifier that is inversely proportional to its weighted error.
• Each classifier gets a vote based on its importance.
Initialize
• Initialize with evenly weighted distribution
• Begin generating classifiers
Error
• Quality of classifier based on weighted error:
• Probability ht will misclassify an example selected according distribution Dt
• Or summation of the weights of all misclassified examples
Classifier Importance
• αt measures the importance given to classifier ht
• αt > 0 if ε t < ( εt assumed to always be < )
• αtis inversely proportional to ε t
Update Distribution
• Increase weight of misclassified examples
• Decrease weight of correctly classified examples
Combine Classifiers
• When classifying a new instance x all of the weak classifiers get a vote weighted by their α
Machine Learning

SCE 5095: Special Topics Course

Instructor: Jinbo Bi

Computer Science and Engineering Dept.

Presenter: Brian McClanahan

Topic: Boosting Generalization Error

Generalization Error
• Generalization error is the true error of a classifier
• Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error
• For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound
Generalization Error First Bound
• empirical risk (training error)
• – boosting rounds
• – VC Dimension of base classifiers
• – number of training examples
• - generalization error
Intuition of Bound: Hoeffding’s inequality
• Define to be a finite set of hypothesis which map examples to 0 or 1
• Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:
• In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .
Intuition of Bound: Hoeffding’s inequality

So

and by Hoeffding’s inequality:

Intuition of Bound

If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .

So

will hold with probability

Intuition of Bound:Bounding all hypotheses in set
• So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis
• How do we bound the difference for all hypotheses in

Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .

Again by setting and solving for we have

Will hold with probability

Intuition of Bound:Bounding Hypotheses in Infinite Set
• What about cases when is infinite
• Even if is infinite given a set of examples the hypothesis in may only be capable of labeling the examples in a number of ways <
• This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes
• By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants
Intuition of Bound:Bounding Hypotheses in Infinite Set
• More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of
• Also define the growth function to be the function which measures the maximum number of dichotomies for any sample of size
Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any

and with probability at least

For all

Intuition of Bound:Bounding Hypotheses in Infinite Set
• It turns out that the growth function is either polynomial in m or
• In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of
• In the case when the grown function is the VC dimension if infinite
• VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points
Intuition of Bound:Bounding Hypotheses in Infinite Set
• The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples

Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m

when

Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any

Thus with probability at least

Intuition of Bound: Adaboost Generalization Error
• The first bound for the Adaboostgeneralization error follows from Saucer’s Lemma in a similar way
• The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations
• Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error
Generalization Error First Bound
• Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.
• This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.
• Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.
Example
• The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.
• Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease

Test

C4.5 test error

Error

Training

Rounds

Margin
• AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made
• This notion of confidence is quantified by the margin
• The margin takes values between 1 and -1
• The magnitude of the margin can be viewed as a measure of confidence
Generalization Error
• In response to empirical findings Schapire et al. derived a new bound.
• This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error
• This bound suggests that higher margins are preferable for lower generalization error
Relation to Support Vector Machines
• The boosting margins theory turns out to have a strong connection with support vector machines
• Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them
• The optimal weights would be the weights that maximize the minimum margin
Relation to Support Vector Machines
• Both support vector machines and boosting can be seen as trying to optimize the same objective function
• Both attempt to maximize the minimum margin
• The difference is the norms used by Boosting and SVM

Boosting Norms

SVM Norms

Relation to Support Vector Machines
• Effects of different norms
• Different norms can lead to very different results, especially in high dimensional spaces
• Different computation requirements
• SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming
• Difference in finding linear classifiers for high dimensional spaces
• SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions
• Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels
AdaBoost Examples and Results

SCE 5095: Special Topics Course

Yousra Almathami

Computer Science and Engineering Dept.

The Rules for Boosting
• Set all weights of training examples equal
• Train a weak learner on the weighted examples
• Check how well the weak learner performs on data and give it a weight based on how well it did
• Re-weight training examples and repeat
• When done, predict by voting by majority

Taken from Bishop

Toy Example

5 Positive examples

5 Negative examples

2-Dimensional plane

Weak hyps: linear separators

3 iterations

All given equal weights

Taken from Schapire

First classifier

Misclassified examples are circled, given more weight

Taken from Schapire

First 2 classifiers

Misclassified examples are circled, given more weight

Taken from Schapire

First 3 classifiers

Misclassified examples are circled, given more weight

Taken from Schapire

Final Classifier learned by Boosting

Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.

Taken from Schapire

Boosting Demo

Online Demo taken from www.Mathworks.com by Richard Stapenhurst

Multiclass Classification

for Boosting

Presented By: Chris Kuhn

Computer Science and Engineering Dept.

Machine Learning
The Idea
• Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?
• What changes in the problem?
• y = {-1,+1} → y = {1, 2, …, k}
• Random guess value changes from ½ to 1/k
• Weak learning classifiers need to be updated
• Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?
• There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible
• THIS IS TOO WEAK!
• Almost the same algorithm as regular AdaBoost
• Works similar to binary AdaBoost but on multiclass problems
• If weak hypothesis has error slightly better than ½ then boosting is possible
• For k = 2, slightly better than ½ represents a better than random guess, what about k > 2?
• TOO STRONG! (unless weak learner is strong)
An Alternative Approach
• Can we create multiple binary problems out of a multiclass problem?
• For example xi is the correct label yi or y`
• K – 1 binary problems for each example
• h(x,y) = 1 if y is the label for x, 0 otherwise
• h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
• h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
• h(xi, yi)=h(xi, y`) → uninformative
An Alternative Approach
• Can we create multiple binary problems out of a multiclass problem?
• For example xi is the correct label yi or y`
• K – 1 binary problems for each example
• h(x,y) = 1 if y is the label for x, 0 otherwise
• h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
• h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
• h(xi, yi)=h(xi, y`) → uninformative
• Generalized to allow multiple labels per example
• Different initial distribution
• ht : X x Y→ Real Number
• ht used to rank labels for a given example
• Now have ranking loss instead of error rate