machine learning n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Machine Learning PowerPoint Presentation
Download Presentation
Machine Learning

Loading in 2 Seconds...

play fullscreen
1 / 88

Machine Learning - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Machine Learning. CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept. Boosting. Method for converting rules of thumb into a prediction rule. Rule of thumb ? Method?. Binary Classification. X: set of all possible instances or examples.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Machine Learning' - kolya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
machine learning
Machine Learning

CSE 5095: Special Topics Course

Boosting

Nhan Nguyen

Computer Science and Engineering Dept.

boosting
Boosting
  • Method for converting rules of thumb into a prediction rule.
  • Rule of thumb?
  • Method?
binary classification
Binary Classification
  • X: set of all possible instances or examples.

- e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast

  • c: X {0,1}: the target concept to learn.

- e.g., c: EnjoySport {0,1}

  • H: set of concept hypotheses

- e.g., conjunctions of literals: <?,Cold,High,?,?,?>

  • C: concept class, a set of target concept c.
  • D: target distribution, a fixed probability distribution over X. Training and test examples are drawn according to D.
binary classification1
Binary Classification
  • S: training sample

<x1,c(x1)>,…,<xm,c(xm)>

  • The learning algorithm receives sample S and selects a hypothesis from H approximating c.

- Find a hypothesis hH such that h(x) = c(x) x S

errors
Errors
  • True error or generalization error of h with respect to the target concept c and distribution D:

[h] =

  • Empirical error: average error of h on the training sample S drawn according to distribution D,

[h] = =

errors1
Errors
  • Questions:
    • Can we bound the true error of a hypothesis given only its training error?
    • How many examples are needed for a good approximation?
approximate concept learning
Approximate Concept Learning
  • Requiring a learner to acquire the right concept is too strict
  • Instead, we will allow the learner to produce a good approximation to the actual concept
general assumptions
General Assumptions
  • Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x)
  • Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them).
  • Goal: h should have a low error rate on new examples drawn from the same distribution D.

[h] =

pac learning model
PAC learning model
  • PAC learning: Probably Approximately Correct learning
  • The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to cand we want to have small error of h [h]
  • If [h] is small, h is “probably approximately correct”.
  • Formally, h is PAC if

Pr[[h] 1 -

for all c C, > 0, > 0, and all distributions D

pac learning model1
PAC learning model

Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that:

Pr[[h] 1 -

uses at most poly(1/1/, size(X), size(c)) examples and running time.

: accuracy, 1 - : confidence.

Such an L is called a strong Learner

pac learning model2
PAC learning model
  • Learner L is a weak learner if learner L output a hypothesis h H such that:

Pr[[h] ( - 1 -

for all c C, > 0, > 0, and all distributions D

  • A weak learner only output an hypothesis that performs slightly better than random guessing
  • Hypothesis boosting problem: can we “boost” a weak learner into a strong learner?
  • Rule of thumb ~ weak leaner
  • Method ~
boosting a weak learner majority vote
Boosting a weak learner – Majority Vote
  • L leans on first N training points
  • L randomly filters the next batch of training points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2.
  • L builds a third training set of N points for which h1 and h2 disagree, and produces h3.
  • L outputs h = Majority Vote(h1, h2, h3)
boosting schapire 89
Boosting [Schapire ’89]

Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote.

A formal description of boosting:

  • Given training set ((…, (
  • {-1, +1}: correct label of X
  • for t = 1, …, T:
    • construct distribution on {1, …, m}
    • find weak hypothesis : X {-1, +1} with small error on
  • output final hypothesis
boosting1
Boosting

Training Sample

(x)

Weighted Sample

(x)

Final hypothesis

H(x) = sign[]

Weighted Sample

(x)

.

.

.

Weighted Sample

(x)

boosting algorithms
Boosting algorithms
  • AdaBoost(Adaptive Boosting)
  • LPBoost (Linear Programming Boosting)
  • BrownBoost
  • MadaBoost (modifying the weighting system of AdaBoost)
  • LogitBoost
lecture
Lecture
  • Motivating example
  • Adaboost
  • Training Error
  • Overfitting
  • Generalization Error
  • Examples of Adaboost
  • Multiclass for weak learner 
machine learning1
Machine Learning

Proof of Bound on Adaboost Training Error

Aaron Palmer

theorem 2 class error bounds
Theorem: 2 Class Error Bounds
  • Assume t = - t
    • = error rate on round of boosting
    • = how much better than random guessing
      • small, positive number
  • Training error is bounded by

Hfinal

implications
Implications?
    • = number of rounds of boosting
    • and do not need to be known in advance
  • As long as then the training error will decrease exponentially as a function of
proof part ii training error
Proof Part II: training error

Training error () =

=

=

=

proof part iii
Proof Part III:

Set equal to zero and solve for

Plug back into

part iii continued
Part III: Continued

Plug in the definition tt

exponential bound
Exponential Bound
  • Use property
  • Take x to be
  • 1
putting it together
Putting it together
  • We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined
  • Bound is pretty loose
example
Example:
  • Suppose that all are at least 10% so that no has an error rate above 40%
  • What upper bound does the theorem place on the training error?
  • Answer:
overfitting
Overfitting?
  • Does the proof say anything about overfitting?
  • While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?
boosting2

Boosting

AymanAlharbi

example spam emails
Example (Spam emails)

* problem: filter out spam(junk email)

- Gather large collection of examples of spam and non-spam

From: Jinbo@engr.uconn.edu

“can you review a paper” ...

non-spam

From: XYZ@F.U

“Win 10000$ easily !!” ...

spam

example spam emails1
Example (Spam emails)

If ‘buy now’ occurs in message, then predict ‘spam’

Main Observation

- Easyto find “rules of thumb” that are “often” correct

- Hard to find single rule that is very highly accurate

slide32

Example (Phone Cards)

Goal: automatically categorize type of call requested by phone customer

(Collect, CallingCard, PersonToPerson, etc.)

- Yes I’d like to place a collect call long distance please

Main Observation

(Collect)

Easyto find

“rules of thumb”

that are “often” correct

- operator I need to make a call but I need to billit to my

office

(ThirdNumber)

If ‘bill’ occurs in

utterance, then predict

‘BillingCredit’

- I just called the wrong and I would like to have that taken off of my bill

(BillingCredit)

Hard to find single rule that is very highly accurate

the boosting approach
The Boosting Approach
  • Devise computer program for deriving rough rules of thumb
  • Apply procedure to subset of emails
  • Obtain rule of thumb
  • Apply to 2nd subset of emails
  • Obtain 2nd rule of thumb
  • Repeat T times
details
Details
  • How to choose examples on each round?

- Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

  • How to combinerules of thumb into single prediction rule?

- Take (weighted) majority vote of rules of thumb !!

  • Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?
slide35
Idea

• At each iteration t :

– Weight each training example by how incorrectly

it was classified

– Learn a hypothesis – ht

– Choose a strength for this hypothesis – αt

Final classifier: weighted combination of

weak learners

Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

boosting3
Boosting

AdaBoostAlgorithm

Presenter: Karl Severin

Computer Science and Engineering Dept.

boosting overview
Boosting Overview
  • Goal: Form one strong classifier from multiple weak classifiers.
  • Proceeds in rounds iteratively producing classifiers from weak learner.
  • Increases weight given to incorrectly classified examples.
  • Gives importance to classifier that is inversely proportional to its weighted error.
  • Each classifier gets a vote based on its importance.
initialize
Initialize
  • Initialize with evenly weighted distribution
  • Begin generating classifiers
error
Error
  • Quality of classifier based on weighted error:
  • Probability ht will misclassify an example selected according distribution Dt
  • Or summation of the weights of all misclassified examples
classifier importance
Classifier Importance
  • αt measures the importance given to classifier ht
  • αt > 0 if ε t < ( εt assumed to always be < )
  • αtis inversely proportional to ε t
update distribution
Update Distribution
  • Increase weight of misclassified examples
  • Decrease weight of correctly classified examples
combine classifiers
Combine Classifiers
  • When classifying a new instance x all of the weak classifiers get a vote weighted by their α
machine learning2
Machine Learning

SCE 5095: Special Topics Course

Instructor: Jinbo Bi

Computer Science and Engineering Dept.

Presenter: Brian McClanahan

Topic: Boosting Generalization Error

generalization error
Generalization Error
  • Generalization error is the true error of a classifier
  • Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error
  • For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound
generalization error first bound
Generalization Error First Bound
  • empirical risk (training error)
  • – boosting rounds
  • – VC Dimension of base classifiers
  • – number of training examples
  • - generalization error
intuition of bound hoeffding s inequality
Intuition of Bound: Hoeffding’s inequality
  • Define to be a finite set of hypothesis which map examples to 0 or 1
  • Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have:
  • In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .
intuition of bound hoeffding s inequality1
Intuition of Bound: Hoeffding’s inequality

So

and by Hoeffding’s inequality:

intuition of bound
Intuition of Bound

If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using .

So

will hold with probability

intuition of bound bounding all hypotheses in set
Intuition of Bound:Bounding all hypotheses in set
  • So we have bounded the difference between the generalization error and the empirical risk for a single hypothesis
  • How do we bound the difference for all hypotheses in

Theorem: Let H be a finite space of hypothesis and assume that a random training set of size m is chosen. Then for any .

Again by setting and solving for we have

Will hold with probability

intuition of bound bounding hypotheses in i nfinite s et
Intuition of Bound:Bounding Hypotheses in Infinite Set
  • What about cases when is infinite
  • Even if is infinite given a set of examples the hypothesis in may only be capable of labeling the examples in a number of ways <
  • This implies that though is infinite the hypotheses are divided into classes which produce the same labelings. So the effective number of hypotheses is equal to the number of classes
  • By using arguments similar to those above in Hoeffding’s inequality can be replaced with the number of effective hypotheses(or dichotomies) along with some additional constants
intuition of bound bounding hypotheses in infinite set
Intuition of Bound:Bounding Hypotheses in Infinite Set
  • More formally let be a finite set of examples and define the set of dichotomies to be all possible labelings of by hypotheses of
  • Also define the growth function to be the function which measures the maximum number of dichotomies for any sample of size
intuition of bound bounding hypotheses in infinite set1
Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be any space of hypotheses and assume that a random training set of size is chosen. Then for any

and with probability at least

For all

intuition of bound bounding hypotheses in infinite set2
Intuition of Bound:Bounding Hypotheses in Infinite Set
  • It turns out that the growth function is either polynomial in m or
  • In the cases where the growth function is polynomial in the degree of the polynomial is the VC dimension of
  • In the case when the grown function is the VC dimension if infinite
  • VC dimension –The maximum number of points which can be shattered by . points are said to be shattered if the can realize all possible labelings of the points
intuition of bound bounding hypotheses in infinite set3
Intuition of Bound:Bounding Hypotheses in Infinite Set
  • The VC dimension turns out to be a very natural measure for the complexity of and can be used to bound the growth function given examples

Saucer’s lemma: If is a hypothesis class of VC dimension , then for all m

when

intuition of bound bounding hypotheses in infinite set4
Intuition of Bound:Bounding Hypotheses in Infinite Set

Theorem: Let be a hypothesis space of VC dimension and assume that a random training set of size . Then for any

Thus with probability at least

intuition of bound adaboost generalization error
Intuition of Bound: Adaboost Generalization Error
  • The first bound for the Adaboostgeneralization error follows from Saucer’s Lemma in a similar way
  • The growth function for Adaboost is bounded using the VC dimension of the base classifiers and the VC dimension for the set of all possible base classifier weight combinations
  • Again these bounds can be used in conjunction with Hoeffding’s inequality to bound the generalization error
generalization error first bound1
Generalization Error First Bound
  • Think of T and d as variables which come together to define the complexity of the ensemble of hypotheses. The log factors and constants are ignored.
  • This bound implies that AdaBoost is likely to overfit if run for to many boosting rounds.
  • Contradicting the spirit of this bound, empirical results suggest that AdaBoost tends to decrease generalization error even after training error reaches zero.
example1
Example
  • The graph shows boosting used on C4.5 to identify images of hand written characters. This experiment was carried out by Robert Schapire et al.
  • Even as the training error goes to zero and AdaBoost has been run for 1000 rounds the test error continues to decrease

Test

C4.5 test error

Error

Training

Rounds

margin
Margin
  • AdaBoost’s resistance to overfitting is attributed to the confidence with which predictions are made
  • This notion of confidence is quantified by the margin
  • The margin takes values between 1 and -1
  • The magnitude of the margin can be viewed as a measure of confidence
generalization error1
Generalization Error
  • In response to empirical findings Schapire et al. derived a new bound.
  • This new bound is defined in terms of the margins, VC dimension and sample size but not the number of rounds or training error
  • This bound suggests that higher margins are preferable for lower generalization error
relation to support vector machines
Relation to Support Vector Machines
  • The boosting margins theory turns out to have a strong connection with support vector machines
  • Imagine that we have already found all weak classifiers and we wish to find the optimal set of weights with which to combine them
  • The optimal weights would be the weights that maximize the minimum margin
relation to support vector machines1
Relation to Support Vector Machines
  • Both support vector machines and boosting can be seen as trying to optimize the same objective function
  • Both attempt to maximize the minimum margin
  • The difference is the norms used by Boosting and SVM

Boosting Norms

SVM Norms

relation to support vector machines2
Relation to Support Vector Machines
  • Effects of different norms
    • Different norms can lead to very different results, especially in high dimensional spaces
    • Different computation requirements
      • SVM corresponds to quadratic programming while AdaBoost corresponds to linear programming
    • Difference in finding linear classifiers for high dimensional spaces
      • SVMs use kernals to perform low dimensional calculations which are equivalent to inner products in high dimensions
      • Boosting employs greedy search by using weight redistributions and weak classifiers to coordinates highly correlated with sample labels
adaboost examples and results
AdaBoost Examples and Results

SCE 5095: Special Topics Course

Yousra Almathami

Computer Science and Engineering Dept.

the rules for boosting
The Rules for Boosting
  • Set all weights of training examples equal
  • Train a weak learner on the weighted examples
  • Check how well the weak learner performs on data and give it a weight based on how well it did
  • Re-weight training examples and repeat
  • When done, predict by voting by majority
overview of adaboost
Overview of Adaboost

Taken from Bishop

toy example
Toy Example

5 Positive examples

5 Negative examples

2-Dimensional plane

Weak hyps: linear separators

3 iterations

All given equal weights

Taken from Schapire

first classifier
First classifier

Misclassified examples are circled, given more weight

Taken from Schapire

first 2 classifiers
First 2 classifiers

Misclassified examples are circled, given more weight

Taken from Schapire

first 3 classifiers
First 3 classifiers

Misclassified examples are circled, given more weight

Taken from Schapire

final classifier learned by boosting
Final Classifier learned by Boosting

Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.

Taken from Schapire

boosting demo
Boosting Demo

Online Demo taken from www.Mathworks.com by Richard Stapenhurst

machine learning3

Multiclass Classification

for Boosting

Presented By: Chris Kuhn

Computer Science and Engineering Dept.

Machine Learning
the idea
The Idea
  • Everything covered has been the binary two class classification problem, what happens when dealing with more than two classes?
  • What changes in the problem?
    • y = {-1,+1} → y = {1, 2, …, k}
    • Random guess value changes from ½ to 1/k
  • Weak learning classifiers need to be updated
    • Can we update the weak learning classifiers to just have an accuracy > 1/k + γ?
    • There are cases where this condition is satisfied but there is no way to drive training error to 0 making boosting impossible
    • THIS IS TOO WEAK!
adaboost m11
AdaBoost.M1
  • Almost the same algorithm as regular AdaBoost
  • Advantage:
    • Works similar to binary AdaBoost but on multiclass problems
  • Disadvantage:
    • If weak hypothesis has error slightly better than ½ then boosting is possible
    • For k = 2, slightly better than ½ represents a better than random guess, what about k > 2?
  • TOO STRONG! (unless weak learner is strong)
an alternative approach
An Alternative Approach
  • Can we create multiple binary problems out of a multiclass problem?
  • For example xi is the correct label yi or y`
    • K – 1 binary problems for each example
  • h(x,y) = 1 if y is the label for x, 0 otherwise
    • h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
    • h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
    • h(xi, yi)=h(xi, y`) → uninformative
an alternative approach1
An Alternative Approach
  • Can we create multiple binary problems out of a multiclass problem?
  • For example xi is the correct label yi or y`
    • K – 1 binary problems for each example
  • h(x,y) = 1 if y is the label for x, 0 otherwise
    • h(xi, yi)=0, h(xi, y`)=1 → y` is correct (wrong)
    • h(xi, yi)=1, h(xi, y`)=0 → yi is correct (right)
    • h(xi, yi)=h(xi, y`) → uninformative
adaboost mr
AdaBoost.MR
  • Generalized to allow multiple labels per example
  • Different initial distribution
  • ht : X x Y→ Real Number
  • ht used to rank labels for a given example
  • Now have ranking loss instead of error rate
additional algorithms
Additional Algorithms
  • AdaBoost.MH
    • One-against-all
    • Requires strong weak learning conditions
  • AdaBoost.MO
    • Runs MH as part of algorithm and uses strong classifier to generate alternative strong classifiers which can perform and extra voting step
    • Still requires a strong weak learning condition
  • SAMME
    • Allows weak learners slightly better than random, Cost matrix instead of weights and a different equation for weak classifier combination
    • Conditions can be too weak for strong margins
take home
Take Home
  • Yes, it is possible
  • There are many multiclass boosting algorithms available
  • No, there is no 'one size fits all' multiclass algorithm