1 / 29

Smooth Boosting By Using An Information-Based Criterion

Smooth Boosting By Using An Information-Based Criterion. Kohei Hatano Kyushu University, JAPAN. Organization of this talk. Introduction Preliminaries Our booster Experiments Summary. Lionel Barrymore (her granduncle). Drew?. Barrymore?. Charlie’s engels?. John Drew Barrymore

elgin
Download Presentation

Smooth Boosting By Using An Information-Based Criterion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

  2. Organization of this talk • Introduction • Preliminaries • Our booster • Experiments • Summary

  3. Lionel Barrymore (her granduncle) Drew? Barrymore? Charlie’s engels? John Drew Barrymore (her father) Barrymore? y y y n n n y n YES YES YES NO NO NO No Yes Jaid Barrymore (her mother) Labeled training data (web pages) John Barrymore (her grandpa) Diana Barrymore (her aunt) Boosting • Methodology to combine prediction rules into a more accurate one . E.g. learning rule to classify web pages on “Drew Barrymore” Set of pred. rules = words accuracy 51%! combination of prediction rules (say, majority vote) + + “The Barrymore family” of Hollywood accuracy 80%

  4. (Huge) data boosting algorithm sample randomly accept reject Boosting by filtering [Schapire90], [Freund 95], Boosting scheme that uses random samplingfrom data Advantage 1: can determine sample size adaptively Advantage 2: smaller spacecomplexity (for sample) batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)

  5. Some known results Boosting algorithms by filtering • Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. • Criterion for choosing prediction rules: accuracy Are there any better criteria? A candidate: information-based criterion • Real AdaBoost [Schapire&Singer 99],InfoBoost [Aslam 00] (a simple version of Real AdaBoost) • Criterion for choosing prediction rules: mutual information • sometimes faster than those using accuracy-based criterion Experimental:[Schapire&Singer 99],Theoretical:[Hatano&Warmuth 03], [Hatano&Watanabe 04] • However, no boosting algorithm by filtering known

  6. Our work Boosting by filtering Information-based criterion lower space complexity faster convergence efficient boosting by filtering using an information-based criterion our work

  7. Introduction • Preliminaries • Our booster • Experiments • Summary

  8. : correct : wrong h1 +1 -1 Illustration of general boosting lower higher 1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1. 0.25 2. Assign a coefficient to h1 based on its quality. 3. Update the distribution.

  9. : correct : wrong h2 +1 -1 Illustration of general boosting(2) higher lower 1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2. 0.28 2. Assign a coefficient to h1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times

  10. h2 h3 h1 +1 +1 +1 -1 -1 -1 Illustration of general boosting(3) Final pred. rule = weighted majority vote of chosen pred. rules. 0.28 0.05 instance x 0.25 + + predict +1, if H(x) >0 predict -1, otherwise

  11. Example: AdaBoost [Freund&Schapire 97] Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) correct wrong Difficult examples (possibly noisy) may have too much weights

  12. Smooth boosting supxDt(x)/D1(x) is poly-bounded • Keeping the distribution “smooth” poly  D1 Dt (distribution costructed by the booster) D1 (original distribution, e.g. uniform) • makes boosting algorithms • noise-tolerant • (statistical query model) MadaBoost [Domingo&Watanabe00] • (malicious noise model ) SmoothBoost [Servedio01] • (agnostic boosting model)AdaFlat [Gavinsky 03] , • sampling from Dt can be simulated efficiently • via sampling from D1 (e.g., by rejection sampling). •  applicable in the boosting by filtering framework

  13. Example: MadaBoost [Domingo & Watanabe 00] l(-yiHt(xi)) Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) Dt is 1/-bounded ( : error of Ht)

  14. Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function

  15. Introduction • Preliminaries • Our booster • Experiments • Summary

  16. Our new booster l(-yiHt(xi)) Criterion for choosing pred. rules (pseudo gain) Coefficient Update -yiHt(xi) Still, Dt is 1/-bounded ( : error of Ht)

  17. Pseudo gain Relation to edge Property: 2 (by convexity of of the square function)

  18. Interpretation of pseudo gain  minh(conditional entropy of labels given ht)  maxh(mutual information between h and labels) the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index but, ・・・

  19. Information-based criteria [Kearns & Mansour 98] Our booster choosesa pred. rule maximizing the mutual information defined by Gini Index (GiniBoost) Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. Good news: Gini index can be estimated via sampling efficiently!

  20. Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of Ht)>  for t=1,…,T. Then Coro. Further, if t (ht)¸, train.err(HT) · in T= O(1/) steps.

  21. Comparison on convergence speed : minimum pseudo gain : minimum edge

  22. Boosting- by- filtering version of GiniBoost (outline) • Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). • Adaptive pred. rule selector. • Boosting alg. in the PAC learning sense.

  23. Introduction • Preliminaries • Our booster • Experiments • Summary

  24. Experiments • Topic classification of Reuters news (Reuters-21578) • Binary classification for each 5 topics (Results are averaged). • 10,000 examples. • 30,000 words used as base pred. rules. • Run algorithms until they sample 1,000,000 examples in total. • 10-fold CV.

  25. Test error over Reuters Note: GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost

  26. Execution time faster by about 4 times! (Cf. similar result w/o samplingRealAdaBoost[Schapire & Singer 99])

  27. Introduction • Preliminaries • Our booster • Experiments • Summary

  28. Summary/Open problem Summary GiniBoost: • uses pseudo gain (Gini index) to choose base prediction rules • shows faster convergence in the filtering scheme. • Open problem • Theoretical analysis on noise-tolerance

  29. Comparison on sample size Observation: smaller accepted examples→ faster selection of pred. rules

More Related