290 likes | 409 Views
This talk by Kohei Hatano from Kyushu University explores advancements in boosting algorithms, focusing on an innovative information-based criterion for selecting prediction rules. The methodology involves combining weak classifiers to improve overall prediction accuracy, particularly in classifying web pages. Through experiments, we demonstrate how the new approach reduces space complexity and enhances convergence speed compared to traditional methods like AdaBoost. Key highlights include adaptive sample size determination and robustness against noise, applying new filtering techniques to achieve higher predictive performance.
E N D
Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN
Organization of this talk • Introduction • Preliminaries • Our booster • Experiments • Summary
Lionel Barrymore (her granduncle) Drew? Barrymore? Charlie’s engels? John Drew Barrymore (her father) Barrymore? y y y n n n y n YES YES YES NO NO NO No Yes Jaid Barrymore (her mother) Labeled training data (web pages) John Barrymore (her grandpa) Diana Barrymore (her aunt) Boosting • Methodology to combine prediction rules into a more accurate one . E.g. learning rule to classify web pages on “Drew Barrymore” Set of pred. rules = words accuracy 51%! combination of prediction rules (say, majority vote) + + “The Barrymore family” of Hollywood accuracy 80%
(Huge) data boosting algorithm sample randomly accept reject Boosting by filtering [Schapire90], [Freund 95], Boosting scheme that uses random samplingfrom data Advantage 1: can determine sample size adaptively Advantage 2: smaller spacecomplexity (for sample) batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)
Some known results Boosting algorithms by filtering • Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. • Criterion for choosing prediction rules: accuracy Are there any better criteria? A candidate: information-based criterion • Real AdaBoost [Schapire&Singer 99],InfoBoost [Aslam 00] (a simple version of Real AdaBoost) • Criterion for choosing prediction rules: mutual information • sometimes faster than those using accuracy-based criterion Experimental:[Schapire&Singer 99],Theoretical:[Hatano&Warmuth 03], [Hatano&Watanabe 04] • However, no boosting algorithm by filtering known
Our work Boosting by filtering Information-based criterion lower space complexity faster convergence efficient boosting by filtering using an information-based criterion our work
Introduction • Preliminaries • Our booster • Experiments • Summary
: correct : wrong h1 +1 -1 Illustration of general boosting lower higher 1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1. 0.25 2. Assign a coefficient to h1 based on its quality. 3. Update the distribution.
: correct : wrong h2 +1 -1 Illustration of general boosting(2) higher lower 1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2. 0.28 2. Assign a coefficient to h1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times
h2 h3 h1 +1 +1 +1 -1 -1 -1 Illustration of general boosting(3) Final pred. rule = weighted majority vote of chosen pred. rules. 0.28 0.05 instance x 0.25 + + predict +1, if H(x) >0 predict -1, otherwise
Example: AdaBoost [Freund&Schapire 97] Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) correct wrong Difficult examples (possibly noisy) may have too much weights
Smooth boosting supxDt(x)/D1(x) is poly-bounded • Keeping the distribution “smooth” poly D1 Dt (distribution costructed by the booster) D1 (original distribution, e.g. uniform) • makes boosting algorithms • noise-tolerant • (statistical query model) MadaBoost [Domingo&Watanabe00] • (malicious noise model ) SmoothBoost [Servedio01] • (agnostic boosting model)AdaFlat [Gavinsky 03] , • sampling from Dt can be simulated efficiently • via sampling from D1 (e.g., by rejection sampling). • applicable in the boosting by filtering framework
Example: MadaBoost [Domingo & Watanabe 00] l(-yiHt(xi)) Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) Dt is 1/-bounded ( : error of Ht)
Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function
Introduction • Preliminaries • Our booster • Experiments • Summary
Our new booster l(-yiHt(xi)) Criterion for choosing pred. rules (pseudo gain) Coefficient Update -yiHt(xi) Still, Dt is 1/-bounded ( : error of Ht)
Pseudo gain Relation to edge Property: 2 (by convexity of of the square function)
Interpretation of pseudo gain minh(conditional entropy of labels given ht) maxh(mutual information between h and labels) the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index but, ・・・
Information-based criteria [Kearns & Mansour 98] Our booster choosesa pred. rule maximizing the mutual information defined by Gini Index (GiniBoost) Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. Good news: Gini index can be estimated via sampling efficiently!
Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of Ht)> for t=1,…,T. Then Coro. Further, if t (ht)¸, train.err(HT) · in T= O(1/) steps.
Comparison on convergence speed : minimum pseudo gain : minimum edge
Boosting- by- filtering version of GiniBoost (outline) • Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). • Adaptive pred. rule selector. • Boosting alg. in the PAC learning sense.
Introduction • Preliminaries • Our booster • Experiments • Summary
Experiments • Topic classification of Reuters news (Reuters-21578) • Binary classification for each 5 topics (Results are averaged). • 10,000 examples. • 30,000 words used as base pred. rules. • Run algorithms until they sample 1,000,000 examples in total. • 10-fold CV.
Test error over Reuters Note: GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost
Execution time faster by about 4 times! (Cf. similar result w/o samplingRealAdaBoost[Schapire & Singer 99])
Introduction • Preliminaries • Our booster • Experiments • Summary
Summary/Open problem Summary GiniBoost: • uses pseudo gain (Gini index) to choose base prediction rules • shows faster convergence in the filtering scheme. • Open problem • Theoretical analysis on noise-tolerance
Comparison on sample size Observation: smaller accepted examples→ faster selection of pred. rules