Why does averaging work?

1 / 45

# Why does averaging work? - PowerPoint PPT Presentation

Why does averaging work?. Yoav Freund AT&amp;T Labs. Plan of talk. Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting Applications. Male. Human Voice. Female. Toy Example. Computer receives telephone call Measures Pitch of voice

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Why does averaging work?' - jonah

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Why does averaging work?

Yoav Freund AT&T Labs

Plan of talk
• Generative vs. non-generative modeling
• Boosting
• Boosting and over-fitting
• Bagging and over-fitting
• Applications

Male

Human

Voice

Female

Toy Example
• Measures Pitch of voice
• Decides gender of caller

mean1

mean2

var1

var2

Probability

Generative modeling

Voice Pitch

mean2

mean1

Probability

No. of mistakes

Ill-behaved data

Voice Pitch

Machine Learning

Decision

Theory

Statistics

Predictions

Actions

Data

Estimated

world state

Another example
• Two dimensional data (example: pitch, volume)
• Generative model: logistic regression
• Discriminative model: separating line ( “perceptron” )

Model:

Find W to maximize

w

w

Model:

Find W to maximize

• Discriminator class: a linear discriminator in the space of “weak hypotheses”
• Original goal: find hyper plane with smallest number of mistakes
• Known to be an NP-hard problem (no algorithm that runs in time polynomial in d, where d is the dimension of the space)
• Computational method: Use exponential loss as a surrogate, perform gradient descent.
• Unforeseen benefit: Out-of-sample error improves even after in-sample error reaches zero.

Prediction =

Margin =

-

Cumulative # examples

w

-

+

+

+

-

Mistakes

Correct

-

+

+

+

-

-

-

-

+

Correct

Mistakes

Margin

Margins view

Project

Logitboost

Brownboost

0-1 loss

Margin

Mistakes

Correct

Loss

One coordinate at a time
• Adds one coordinate (“weak learner”) at each iteration.
• Weak learning in binary classification = slightly better than random guessing.
• Weak learning in regression – unclear.
• Uses example-weights to communicate the gradient direction to the weak learner
• Solves a computational problem
Curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Fraction of training example

with small margin

Probability of mistake

Size of training sample

VC dimension of weak rules

Theorem

Schapire, Freund, Bartlett & Lee

Annals of stat. 98

For any convex combination and any threshold

No dependence on number

of weak rules

that are combined!!!

d

f

g

d(f,g) = P( f(x) = g(x) )

A metric space of classifiers

Classifier space

Example Space

Neighboring models make similar predictions

Certainly

Certainly

Bagged variants

No. of mistakes

Majority vote

Confidence zones

Voice Pitch

Bagging
• Averaging over all good models increases stability (decreases dependence on particular sample)
• If clear majority the prediction is very reliable(unlike Florida)
• Bagging is a randomized algorithm for sampling the best model’s neighborhood
• Ignoring computation, averaging of best models can be done deterministically, and yield provable bounds.

set

;

Weights:

Log odds:

+1

l(x)>+

Prediction(x) =

0

l(x)< -

-1

Freund, Mansour, Schapire 2000

A pseudo-Bayesian method

Define a prior distribution over rulesp(h)

Get m training examples,

For each h : err(h) = (no. of mistakes h makes)/m

else

Insufficient data

Applications of Boosting
• Applied research
• Commercial deployment

% test error rates

Schapire, Singer, Gorin 98

Applied research
• Classify voice requests
• Voice -> text -> category
• Fourteen categories

Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time

Examples
• Yes I’d like to place a collect call long distance please
• Operator I need to make a call but I need to bill it to my office
• Yes I’d like to place a call on my master card please
• I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill
• collect
• third party
• calling card
• billing credit

Word occurs

Word does

not occur

Weak rules generated by “boostexter”

Third party

Collect call

Calling card

Category

Weak Rule

Results
• 7844 training examples
• hand transcribed
• 1000 test examples
• hand / machine transcribed
• Accuracy with 20% rejected
• Machine transcribed: 75%
• Hand transcribed: 90%

Freund, Mason, Rogers, Pregibon, Cortes 2000

Commercial deployment
• Using statistics from call-detail records
• Alternating decision trees
• Similar to boosting decision trees, more flexible
• Combines very simple rules
• Can over-fit, cross validation used to stop
Massive datasets
• 260M calls / day
• 230M telephone numbers
• Label unknown for ~30%
• Hancock: software for computing statistical signatures.
• 100K randomly selected training examples,
• ~10K is enough
• Training takes about 2 hours.
• Generated classifier has to be both accurateand efficient
• Increased coverage from 44% to 56%
• Accuracy ~94%
• Saved AT&T 15M\$ in the year 2000 in operations costs and missed opportunities.
Summary
• Boosting is a computational method for learning accurate classifiers
• Resistance to over-fit explained by margins
• Underlying explanation – large “neighborhoods” of good classifiers
• Boosting has been applied successfully to a variety of classification problems
• Questions welcome!
• Data, even better!!
• Potential collaborations, yes!!!
• Yoav@research.att.com
• www.predictiontools.com