Machine Learning

Machine Learning CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept.

Boosting • Method for converting rules of thumb into a prediction rule. • Rule of thumb? • Method?

Binary Classification • X: set of all possible instances or examples. - e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast • c: X {0,1}: the target concept to learn. - e.g., c: EnjoySport {0,1} • H: set of concept hypotheses - e.g., conjunctions of literals: <?,Cold,High,?,?,?> • C: concept class, a set of target concept c. • D: target distribution, a fixed probability distribution over X. Training and test examples are drawn according to D.

Binary Classification • S: training sample <x1,c(x1)>,…,<xm,c(xm)> • The learning algorithm receives sample S and selects a hypothesis from H approximating c. - Find a hypothesis hH such that h(x) = c(x) x S

Errors • True error or generalization error of h with respect to the target concept c and distribution D: [h] = • Empirical error: average error of h on the training sample S drawn according to distribution D, [h] = =

Errors • Questions: • Can we bound the true error of a hypothesis given only its training error? • How many examples are needed for a good approximation?

Approximate Concept Learning • Requiring a learner to acquire the right concept is too strict • Instead, we will allow the learner to produce a good approximation to the actual concept

General Assumptions • Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x) • Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them). • Goal: h should have a low error rate on new examples drawn from the same distribution D. [h] =

PAC learning model • PAC learning: Probably Approximately Correct learning • The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to cand we want to have small error of h [h] • If [h] is small, h is “probably approximately correct”. • Formally, h is PAC if Pr[[h] 1 - for all c C, > 0, > 0, and all distributions D

PAC learning model Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that: Pr[[h] 1 - uses at most poly(1/1/, size(X), size(c)) examples and running time. : accuracy, 1 - : confidence. Such an L is called a strong Learner

PAC learning model • Learner L is a weak learner if learner L output a hypothesis h H such that: Pr[[h] ( - 1 - for all c C, > 0, > 0, and all distributions D • A weak learner only output an hypothesis that performs slightly better than random guessing • Hypothesis boosting problem: can we “boost” a weak learner into a strong learner? • Rule of thumb ~ weak leaner • Method ~

Boosting a weak learner – Majority Vote • L leans on first N training points • L randomly filters the next batch of training points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2. • L builds a third training set of N points for which h1 and h2 disagree, and produces h3. • L outputs h = Majority Vote(h1, h2, h3)

Boosting [Schapire ’89] Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote. A formal description of boosting: • Given training set ((…, ( • {-1, +1}: correct label of X • for t = 1, …, T: • construct distribution on {1, …, m} • find weak hypothesis : X {-1, +1} with small error on • output final hypothesis

Boosting Training Sample (x) Weighted Sample (x) Final hypothesis H(x) = sign[] Weighted Sample (x) . . . Weighted Sample (x)

Boosting algorithms • AdaBoost(Adaptive Boosting) • LPBoost (Linear Programming Boosting) • BrownBoost • MadaBoost (modifying the weighting system of AdaBoost) • LogitBoost

Lecture • Motivating example • Adaboost • Training Error • Overfitting • Generalization Error • Examples of Adaboost • Multiclass for weak learner

Thank you!

Machine Learning Proof of Bound on Adaboost Training Error Aaron Palmer

Theorem: 2 Class Error Bounds • Assume t = - t • = error rate on round of boosting • = how much better than random guessing • small, positive number • Training error is bounded by Hfinal

Implications? • = number of rounds of boosting • and do not need to be known in advance • As long as then the training error will decrease exponentially as a function of

Proof Part I: Unwrap Distribution Let = = = =

Proof Part II: training error Training error () = = = =

Proof Part III: Set equal to zero and solve for Plug back into

Part III: Continued Plug in the definition tt

Exponential Bound • Use property • Take x to be • 1

Putting it together • We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined • Bound is pretty loose

Example: • Suppose that all are at least 10% so that no has an error rate above 40% • What upper bound does the theorem place on the training error? • Answer:

Overfitting? • Does the proof say anything about overfitting? • While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

Boosting AymanAlharbi

Example (Spam emails) * problem: filter out spam(junk email) - Gather large collection of examples of spam and non-spam From: Jinbo@engr.uconn.edu “can you review a paper” ... non-spam From: XYZ@F.U “Win 10000$ easily !!” ... spam

Example (Spam emails) If ‘buy now’ occurs in message, then predict ‘spam’ Main Observation - Easyto find “rules of thumb” that are “often” correct - Hard to find single rule that is very highly accurate

Example (Phone Cards) Goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - Yes I’d like to place a collect call long distance please Main Observation (Collect) Easyto find “rules of thumb” that are “often” correct - operator I need to make a call but I need to billit to my office (ThirdNumber) If ‘bill’ occurs in utterance, then predict ‘BillingCredit’ - I just called the wrong and I would like to have that taken off of my bill (BillingCredit) Hard to find single rule that is very highly accurate

The Boosting Approach • Devise computer program for deriving rough rules of thumb • Apply procedure to subset of emails • Obtain rule of thumb • Apply to 2nd subset of emails • Obtain 2nd rule of thumb • Repeat T times

Details • How to choose examples on each round? - Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) • How to combinerules of thumb into single prediction rule? - Take (weighted) majority vote of rules of thumb !! • Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?

Idea • At each iteration t : – Weight each training example by how incorrectly it was classified – Learn a hypothesis – ht – Choose a strength for this hypothesis – αt Final classifier: weighted combination of weak learners Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

Boosting AdaBoostAlgorithm Presenter: Karl Severin Computer Science and Engineering Dept.

Boosting Overview • Goal: Form one strong classifier from multiple weak classifiers. • Proceeds in rounds iteratively producing classifiers from weak learner. • Increases weight given to incorrectly classified examples. • Gives importance to classifier that is inversely proportional to its weighted error. • Each classifier gets a vote based on its importance.

Initialize • Initialize with evenly weighted distribution • Begin generating classifiers

Error • Quality of classifier based on weighted error: • Probability ht will misclassify an example selected according distribution Dt • Or summation of the weights of all misclassified examples

Classifier Importance • αt measures the importance given to classifier ht • αt > 0 if ε t < ( εt assumed to always be < ) • αtis inversely proportional to ε t

Update Distribution • Increase weight of misclassified examples • Decrease weight of correctly classified examples

Combine Classifiers • When classifying a new instance x all of the weak classifiers get a vote weighted by their α

Review

Questions? ?

Machine Learning SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Presenter: Brian McClanahan Topic: Boosting Generalization Error

Generalization Error • Generalization error is the true error of a classifier • Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error • For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound

Generalization Error First Bound • empirical risk (training error) • – boosting rounds • – VC Dimension of base classifiers • – number of training examples • - generalization error

Intuition of Bound: Hoeffding’s inequality • Define to be a finite set of hypothesis which map examples to 0 or 1 • Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have: • In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .

Intuition of Bound: Hoeffding’s inequality So and by Hoeffding’s inequality:

Intuition of Bound If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using . So will hold with probability

Machine Learning