1 / 88

Machine Learning

Machine Learning. CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept. Boosting. Method for converting rules of thumb into a prediction rule. Rule of thumb ? Method?. Binary Classification. X: set of all possible instances or examples.

kolya
Download Presentation

Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning CSE 5095: Special Topics Course Boosting Nhan Nguyen Computer Science and Engineering Dept.

  2. Boosting • Method for converting rules of thumb into a prediction rule. • Rule of thumb? • Method?

  3. Binary Classification • X: set of all possible instances or examples. - e.g., Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast • c: X {0,1}: the target concept to learn. - e.g., c: EnjoySport {0,1} • H: set of concept hypotheses - e.g., conjunctions of literals: <?,Cold,High,?,?,?> • C: concept class, a set of target concept c. • D: target distribution, a fixed probability distribution over X. Training and test examples are drawn according to D.

  4. Binary Classification • S: training sample <x1,c(x1)>,…,<xm,c(xm)> • The learning algorithm receives sample S and selects a hypothesis from H approximating c. - Find a hypothesis hH such that h(x) = c(x) x S

  5. Errors • True error or generalization error of h with respect to the target concept c and distribution D: [h] = • Empirical error: average error of h on the training sample S drawn according to distribution D, [h] = =

  6. Errors • Questions: • Can we bound the true error of a hypothesis given only its training error? • How many examples are needed for a good approximation?

  7. Approximate Concept Learning • Requiring a learner to acquire the right concept is too strict • Instead, we will allow the learner to produce a good approximation to the actual concept

  8. General Assumptions • Assumption: Examples are generated according to a probability distribution D(x) and labeled according to an unknown function c: y = c(x) • Learning Algorithm: from a set of m examples, outputs a hypothesis h H that is consistent with those examples (i.e., correctly classifies all of them). • Goal: h should have a low error rate on new examples drawn from the same distribution D. [h] =

  9. PAC learning model • PAC learning: Probably Approximately Correct learning • The goal of the learning algorithm: to do optimization over S to find some hypothesis h: X {0,1} H, approximate to cand we want to have small error of h [h] • If [h] is small, h is “probably approximately correct”. • Formally, h is PAC if Pr[[h] 1 - for all c C, > 0, > 0, and all distributions D

  10. PAC learning model Concept class C is PAC-learnable if there is a learner L output a hypothesis h H for all c C, > 0, > 0, and all distributions D such that: Pr[[h] 1 - uses at most poly(1/1/, size(X), size(c)) examples and running time. : accuracy, 1 - : confidence. Such an L is called a strong Learner

  11. PAC learning model • Learner L is a weak learner if learner L output a hypothesis h H such that: Pr[[h] ( - 1 - for all c C, > 0, > 0, and all distributions D • A weak learner only output an hypothesis that performs slightly better than random guessing • Hypothesis boosting problem: can we “boost” a weak learner into a strong learner? • Rule of thumb ~ weak leaner • Method ~

  12. Boosting a weak learner – Majority Vote • L leans on first N training points • L randomly filters the next batch of training points, extracting N/2 points correctly classified by h1, N/2 incorrectly classified, and produces h2. • L builds a third training set of N points for which h1 and h2 disagree, and produces h3. • L outputs h = Majority Vote(h1, h2, h3)

  13. Boosting [Schapire ’89] Idea: given a weak leaner, run it multiple times on (reweighted) training data, then let learned classifiers vote. A formal description of boosting: • Given training set ((…, ( • {-1, +1}: correct label of X • for t = 1, …, T: • construct distribution on {1, …, m} • find weak hypothesis : X {-1, +1} with small error on • output final hypothesis

  14. Boosting Training Sample (x) Weighted Sample (x) Final hypothesis H(x) = sign[] Weighted Sample (x) . . . Weighted Sample (x)

  15. Boosting algorithms • AdaBoost(Adaptive Boosting) • LPBoost (Linear Programming Boosting) • BrownBoost • MadaBoost (modifying the weighting system of AdaBoost) • LogitBoost

  16. Lecture • Motivating example • Adaboost • Training Error • Overfitting • Generalization Error • Examples of Adaboost • Multiclass for weak learner 

  17. Thank you!

  18. Machine Learning Proof of Bound on Adaboost Training Error Aaron Palmer

  19. Theorem: 2 Class Error Bounds • Assume t = - t • = error rate on round of boosting • = how much better than random guessing • small, positive number • Training error is bounded by Hfinal

  20. Implications? • = number of rounds of boosting • and do not need to be known in advance • As long as then the training error will decrease exponentially as a function of

  21. Proof Part I: Unwrap Distribution Let = = = =

  22. Proof Part II: training error Training error () = = = =

  23. Proof Part III: Set equal to zero and solve for Plug back into

  24. Part III: Continued Plug in the definition tt

  25. Exponential Bound • Use property • Take x to be • 1

  26. Putting it together • We’ve proven that the training error drops exponentially fast as a function of the number of base classifiers combined • Bound is pretty loose

  27. Example: • Suppose that all are at least 10% so that no has an error rate above 40% • What upper bound does the theorem place on the training error? • Answer:

  28. Overfitting? • Does the proof say anything about overfitting? • While this is good theory, can we get anything useful out of this proof as far as dealing with unseen data?

  29. Boosting AymanAlharbi

  30. Example (Spam emails) * problem: filter out spam(junk email) - Gather large collection of examples of spam and non-spam From: Jinbo@engr.uconn.edu “can you review a paper” ... non-spam From: XYZ@F.U “Win 10000$ easily !!” ... spam

  31. Example (Spam emails) If ‘buy now’ occurs in message, then predict ‘spam’ Main Observation - Easyto find “rules of thumb” that are “often” correct - Hard to find single rule that is very highly accurate

  32. Example (Phone Cards) Goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - Yes I’d like to place a collect call long distance please Main Observation (Collect) Easyto find “rules of thumb” that are “often” correct - operator I need to make a call but I need to billit to my office (ThirdNumber) If ‘bill’ occurs in utterance, then predict ‘BillingCredit’ - I just called the wrong and I would like to have that taken off of my bill (BillingCredit) Hard to find single rule that is very highly accurate

  33. The Boosting Approach • Devise computer program for deriving rough rules of thumb • Apply procedure to subset of emails • Obtain rule of thumb • Apply to 2nd subset of emails • Obtain 2nd rule of thumb • Repeat T times

  34. Details • How to choose examples on each round? - Concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) • How to combinerules of thumb into single prediction rule? - Take (weighted) majority vote of rules of thumb !! • Can prove If can always find weak rules of thumb slightly better than random guessing then can learn almost perfectly using boosting ?

  35. Idea • At each iteration t : – Weight each training example by how incorrectly it was classified – Learn a hypothesis – ht – Choose a strength for this hypothesis – αt Final classifier: weighted combination of weak learners Idea : given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote

  36. Boosting AdaBoostAlgorithm Presenter: Karl Severin Computer Science and Engineering Dept.

  37. Boosting Overview • Goal: Form one strong classifier from multiple weak classifiers. • Proceeds in rounds iteratively producing classifiers from weak learner. • Increases weight given to incorrectly classified examples. • Gives importance to classifier that is inversely proportional to its weighted error. • Each classifier gets a vote based on its importance.

  38. Initialize • Initialize with evenly weighted distribution • Begin generating classifiers

  39. Error • Quality of classifier based on weighted error: • Probability ht will misclassify an example selected according distribution Dt • Or summation of the weights of all misclassified examples

  40. Classifier Importance • αt measures the importance given to classifier ht • αt > 0 if ε t < ( εt assumed to always be < ) • αtis inversely proportional to ε t

  41. Update Distribution • Increase weight of misclassified examples • Decrease weight of correctly classified examples

  42. Combine Classifiers • When classifying a new instance x all of the weak classifiers get a vote weighted by their α

  43. Review

  44. Questions? ?

  45. Machine Learning SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Presenter: Brian McClanahan Topic: Boosting Generalization Error

  46. Generalization Error • Generalization error is the true error of a classifier • Most supervised machine learning algorithms operate under the premise that lowering the training error will lead to a lower generalization error • For some machine learning algorithms a relationship between the training and generalization error can be defined through a bound

  47. Generalization Error First Bound • empirical risk (training error) • – boosting rounds • – VC Dimension of base classifiers • – number of training examples • - generalization error

  48. Intuition of Bound: Hoeffding’s inequality • Define to be a finite set of hypothesis which map examples to 0 or 1 • Hoeffding’s inequality: Let be independent random variables such that . Denote their average value . Then for any we have: • In the context of machine learning, think of the as errors given by a hypotheses . where is the true label for .

  49. Intuition of Bound: Hoeffding’s inequality So and by Hoeffding’s inequality:

  50. Intuition of Bound If we want to bound the generalization error off a single hypothesis with probability , where then we can solve for using . So will hold with probability

More Related