An Introduction to Boosting

1 / 49

An Introduction to Boosting - PowerPoint PPT Presentation

An Introduction to Boosting. Yoav Freund Banter Inc. Plan of talk. Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications. Male. Human Voice. Female. Toy Example. Computer receives telephone call Measures Pitch of voice

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'An Introduction to Boosting' - paul2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

An Introduction to Boosting

Yoav Freund

Banter Inc.

Plan of talk
• Generative vs. non-generative modeling
• Boosting
• Alternating decision trees
• Boosting and over-fitting
• Applications

Male

Human

Voice

Female

Toy Example
• Measures Pitch of voice
• Decides gender of caller

mean1

mean2

var1

var2

Probability

Generative modeling

Voice Pitch

mean2

mean1

Probability

No. of mistakes

Ill-behaved data

Voice Pitch

Machine Learning

Decision

Theory

Statistics

Predictions

Actions

Data

Estimated

world state

Non-negative weights

sum to 1

Binary label

Feature vector

(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)

Weighted

training set

instances

labels

x1,x2,x3,…,xn

y1,y2,y3,…,yn

The weak requirement:

A weak learner

A weak rule

weak learner

h

h

weak learner

h1

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,1/n), … (xn,yn,1/n)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

(x1,y1,w1), … (xn,yn,wn)

weak learner

h2

h8

hT

h9

h7

h5

h3

h4

h6

The boosting process

Sign[]

+

+

+

Final rule:

a1

h1

a2

h2

aT

hT

• Binary labels y = -1,+1
• margin(x,y) = y [St atht(x)]
• P(x,y) = (1/Z) exp (-margin(x,y))
• Given ht, we choose at to minimize

S(x,y) exp (-margin(x,y))

• If advantages of weak rules over random guessing are: g1,g2,..,gTthen in-sample error of final rule is at most

(w.r.t. the initial weights)

A demo
• Discriminator class: a linear discriminator in the space of “weak hypotheses”
• Original goal: find hyper plane with smallest number of mistakes
• Known to be an NP-hard problem (no algorithm that runs in time polynomial in d, where d is the dimension of the space)
• Computational method: Use exponential loss as a surrogate, perform gradient descent.

Prediction =

Margin =

-

Cumulative # examples

w

-

+

+

+

-

Mistakes

Correct

-

+

+

+

-

-

-

-

+

Correct

Mistakes

Margin

Margins view

Project

Logitboost

Brownboost

0-1 loss

Margin

Mistakes

Correct

Loss

One coordinate at a time
• Adds one coordinate (“weak learner”) at each iteration.
• Weak learning in binary classification = slightly better than random guessing.
• Weak learning in regression – unclear.
• Uses example-weights to communicate the gradient direction to the weak learner
• Solves a computational problem
What is a good weak learner?
• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.
• Small enough to allow exhaustive search for the minimal weighted training error.
• Small enough to avoid over-fitting.
• Should be able to calculate predicted label very efficiently.
• Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

Alternating Trees

Joint work with Llew Mason

Y

no

no

yes

yes

5

X

3

Decision Trees

-1

+1

X>3

-1

Y>5

-1

-1

+1

Y

-1

+1

+0.2

-0.1

+0.1

X>3

no

yes

-1

-0.3

+0.1

sign

-0.1

Y>5

no

yes

-0.3

+0.2

X

Decision tree as a sum

-0.2

-0.2

Y

+0.2

Y<1

-0.1

+0.1

-1

+1

no

yes

X>3

+0.7

0.0

no

yes

-0.3

+0.1

-0.1

-1

sign

Y>5

no

yes

-0.3

+0.2

+1

X

An alternating decision tree

-0.2

+0.7

Example: Medical Diagnostics
• Cleve dataset from UC Irvine database.
• Heart disease diagnostics (+1=healthy,-1=sick)
• 13 features from tests (real valued and discrete).
• 303 instances.
Curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Fraction of training example

with small margin

Probability of mistake

Size of training sample

VC dimension of weak rules

Theorem

Schapire, Freund, Bartlett & Lee

Annals of stat. 98

For any convex combination and any threshold

No dependence on number

of weak rules

that are combined!!!

Applications of Boosting
• Applied research
• Commercial deployment

% test error rates

Schapire, Singer, Gorin 98

Applied research
• Classify voice requests
• Voice -> text -> category
• Fourteen categories

Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time

Examples
• Yes I’d like to place a collect call long distance please
• Operator I need to make a call but I need to bill it to my office
• Yes I’d like to place a call on my master card please
• I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill
• collect
• third party
• calling card
• billing credit

Word occurs

Word does

not occur

Weak rules generated by “boostexter”

Third party

Collect call

Calling card

Category

Weak Rule

Results
• 7844 training examples
• hand transcribed
• 1000 test examples
• hand / machine transcribed
• Accuracy with 20% rejected
• Machine transcribed: 75%
• Hand transcribed: 90%

Freund, Mason, Rogers, Pregibon, Cortes 2000

Commercial deployment
• Using statistics from call-detail records
• Alternating decision trees
• Similar to boosting decision trees, more flexible
• Combines very simple rules
• Can over-fit, cross validation used to stop
Massive datasets
• 260M calls / day
• 230M telephone numbers
• Label unknown for ~30%
• Hancock: software for computing statistical signatures.
• 100K randomly selected training examples,
• ~10K is enough
• Training takes about 2 hours.
• Generated classifier has to be both accurateand efficient