Smooth Boosting By Using An Information-Based Criterion

Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Organization of this talk • Introduction • Preliminaries • Our booster • Experiments • Summary

Lionel Barrymore (her granduncle) Drew? Barrymore? Charlie’s engels? John Drew Barrymore (her father) Barrymore? y y y n n n y n ＹＥＳＹＥＳＹＥＳ NO NO NO No Yes Jaid Barrymore (her mother) Labeled training data (web pages) John Barrymore (her grandpa) Diana Barrymore (her aunt) Boosting • Methodology to combine prediction rules into a more accurate one . E.g. learning rule to classify web pages on “Drew Barrymore” Set of pred. rules = words accuracy 51%! combination of prediction rules (say, majority vote) ＋＋ “The Barrymore family” of Hollywood accuracy 80%

(Huge) data boosting algorithm sample randomly accept reject Boosting by filtering [Schapire90], [Freund 95], Boosting scheme that uses random samplingfrom data Advantage 1: can determine sample size adaptively Advantage 2: smaller spacecomplexity (for sample) batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)

Some known results Boosting algorithms by filtering • Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. • Criterion for choosing prediction rules：　accuracy Are there any better criteria? A candidate: information-based criterion • Real AdaBoost [Schapire&Singer 99],InfoBoost [Aslam 00] (a simple version of Real AdaBoost) • Criterion for choosing prediction rules：　mutual information • sometimes faster than those using accuracy-based criterion Experimental:[Schapire&Singer 99],Theoretical:[Hatano&Warmuth 03], [Hatano&Watanabe 04] • However, no boosting algorithm by filtering known

Our work Boosting by filtering Information-based criterion lower space complexity faster convergence efficient boosting by filtering using an information-based criterion our work

Introduction • Preliminaries • Our booster • Experiments • Summary

: correct : wrong h1 +1 -1 Illustration of general boosting lower higher 1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1. 0.25 2. Assign a coefficient to h1 based on its quality. 3. Update the distribution.

: correct : wrong h2 +1 -1 Illustration of general boosting(2) higher lower 1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2. 0.28 2. Assign a coefficient to h1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times

h2 h3 h1 +1 +1 +1 -1 -1 -1 Illustration of general boosting(3) Final pred. rule = weighted majority vote of chosen pred. rules. 0.28 0.05 instance x 0.25 + + predict +1, if H(x) >0 predict -1, otherwise

Example: AdaBoost [Freund&Schapire 97] Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) correct wrong Difficult examples (possibly noisy) may have too much weights

Smooth boosting supxDt(x)/D1(x) is poly-bounded • Keeping the distribution “smooth” poly  D1 Dt (distribution costructed by the booster) D1 (original distribution, e.g. uniform) • makes boosting algorithms • noise-tolerant • (statistical query model) MadaBoost [Domingo&Watanabe00] • (malicious noise model ) SmoothBoost [Servedio01] • (agnostic boosting model)AdaFlat [Gavinsky 03] , • sampling from Dt can be simulated efficiently • via sampling from D1 (e.g., by rejection sampling). •  applicable in the boosting by filtering framework

Example: MadaBoost [Domingo & Watanabe 00] l(-yiHt(xi)) Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) Dt is 1/-bounded ( : error of Ht)

Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function

Our new booster l(-yiHt(xi)) Criterion for choosing pred. rules (pseudo gain) Coefficient Update -yiHt(xi) Still, Dt is 1/-bounded ( : error of Ht)

Pseudo gain Relation to edge Property: 2 (by convexity of of the square function)

Interpretation of pseudo gain  minh(conditional entropy of labels given ht)  maxh(mutual information between h and labels) the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index but, ・・・

Information-based criteria [Kearns & Mansour 98] Our booster choosesa pred. rule maximizing the mutual information defined by Gini Index (GiniBoost) Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. Good news: Gini index can be estimated via sampling efficiently!

Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of Ht)>  for t=1,…,T. Then Coro. Further, if t (ht)¸, train.err(HT) · in T= O(1/) steps.

Comparison on convergence speed : minimum pseudo gain : minimum edge

Boosting- by- filtering version of GiniBoost (outline) • Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). • Adaptive pred. rule selector. • Boosting alg. in the PAC learning sense.

Experiments • Topic classification of Reuters news (Reuters-21578) • Binary classification for each 5 topics (Results are averaged). • 10,000 examples. • 30,000 words used as base pred. rules. • Run algorithms until they sample 1,000,000 examples in total. • 10-fold CV.

Test error over Reuters Note：　GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost

Execution time faster by about 4 times! (Cf. similar result w/o samplingRealAdaBoost[Schapire & Singer 99])

Summary/Open problem Summary GiniBoost: • uses pseudo gain (Gini index) to choose base prediction rules • shows faster convergence in the filtering scheme. • Open problem • Theoretical analysis on noise-tolerance

Comparison on sample size Observation：　smaller accepted examples→ faster selection of pred. rules

Smooth Boosting By Using An Information-Based Criterion

Smooth Boosting By Using An Information-Based Criterion

Presentation Transcript

Building Live Media Viewing Experiences Using Internet Information Services (IIS) Smooth Streaming and the Smooth Stream

An Introduction to Boosting

“Criterion Based” vs. “Norm-Based” Evaluation

Using the Bayesian Information Criterion to Judge Models and Statistical Significance

Criterion D

Video-based Lane Detection using Boosting Principles

Criterion

Criterion

Akaike Information Criterion

Criterion 3

Boosting an Associative Classifier

A criterion-based PLS approach to SEM

Recognition using Boosting

Criterion Based Marking and Feedback to Students

Location Based Information Service using CORBA

Optimal Path Planning Using the Minimum-Time Criterion by James Bobrow

Take Advantage Of Boosting ROI By Using Targeted Biotechnolo

An Introduction to Boosting

Information criterion

Information Criterion for Model Selection

Regression Using Boosting

Information Thoretic Criterion (AIC) and Camera Calibration