Learning and vision discriminative models l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 85

Learning and Vision: Discriminative Models PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on
  • Presentation posted in: General

Learning and Vision: Discriminative Models. Chris Bishop and Paul Viola. Part II: Algorithms and Applications. Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines Face and pedestrian detection AdaBoost Faces Building Fast Classifiers

Download Presentation

Learning and Vision: Discriminative Models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Learning and vision discriminative models l.jpg

Learning and Vision:Discriminative Models

Chris Bishop and Paul Viola


Part ii algorithms and applications l.jpg

Part II: Algorithms and Applications

  • Part I: Fundamentals

  • Part II: Algorithms and Applications

  • Support Vector Machines

    • Face and pedestrian detection

  • AdaBoost

    • Faces

  • Building Fast Classifiers

    • Trading off speed for accuracy…

    • Face and object detection

  • Memory Based Learning

    • Simard

    • Moghaddam


History lesson l.jpg

History Lesson

  • 1950’s Perceptrons are cool

    • Very simple learning rule, can learn “complex” concepts

    • Generalized perceptrons are better -- too many weights

  • 1960’s Perceptron’s stink (M+P)

    • Some simple concepts require exponential # of features

      • Can’t possibly learn that, right?

  • 1980’s MLP’s are cool (R+M / PDP)

    • Sort of simple learning rule, can learn anything (?)

    • Create just the features you need

  • 1990 MLP’s stink

    • Hard to train : Slow / Local Minima

  • 1996 Perceptron’s are cool


Why did we need multi layer perceptrons l.jpg

Why did we need multi-layer perceptrons?

  • Problems like this seem to require very complex non-linearities.

  • Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.


Why an exponential number of features l.jpg

14th Order???

120 Features

Why an exponential number of features?

N=21, k=5 --> 65,000 features


Mlp s vs perceptron l.jpg

MLP’s vs. Perceptron

  • MLP’s are hard to train…

    • Takes a long time (unpredictably long)

    • Can converge to poor minima

  • MLP are hard to understand

    • What are they really doing?

  • Perceptrons are easy to train…

    • Type of linear programming. Polynomial time.

    • One minimum which is global.

  • Generalized perceptrons are easier to understand.

    • Polynomial functions.


Perceptron training is linear programming l.jpg

What about linearly inseparable?

Perceptron Training is Linear Programming

Polynomial time in the number of variables

and in the number of constraints.


Rebirth of perceptrons l.jpg

Support Vector Machines

Rebirth of Perceptrons

  • How to train effectively

    • Linear Programming (… later quadratic programming)

    • Though on-line works great too.

  • How to get so many features inexpensively?!?

    • Kernel Trick

  • How to generalize with so many features?

    • VC dimension. (Or is it regularization?)


Lemma 1 weight vectors are simple l.jpg

Lemma 1: Weight vectors are simple

  • The weight vector lives in a sub-space spanned by the examples…

    • Dimensionality is determined by the number of examples not the complexity of the space.


Lemma 2 only need to compare examples l.jpg

Lemma 2: Only need to compare examples


Simple kernels yield complex features l.jpg

Simple Kernels yield Complex Features


But kernel perceptrons can generalize poorly l.jpg

But Kernel Perceptrons CanGeneralize Poorly


Perceptron rebirth generalization l.jpg

Perceptron Rebirth: Generalization

  • Too many features … Occam is unhappy

    • Perhaps we should encourage smoothness?

Smoother


Linear program is not unique l.jpg

Linear Program is not unique

The linear program can return any multiple of the correct

weight vector...

Slack variables & Weight prior

- Force the solution toward zero


Definition of the margin l.jpg

Definition of the Margin

  • Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane

  • Classifier Margin


Require non zero margin l.jpg

Require non-zero margin

Allows solutions

with zero margin

Enforces a non-zero

margin between examples

and the decision boundary.


Constrained optimization l.jpg

Constrained Optimization

  • Find the smoothest function that separates data

    • Quadratic Programming (similar to Linear Programming)

      • Single Minima

      • Polynomial Time algorithm


Constrained optimization 2 l.jpg

Constrained Optimization 2


Svm examples l.jpg

SVM: examples


Svm key ideas l.jpg

SVM: Key Ideas

  • Augment inputs with a very large feature set

    • Polynomials, etc.

  • Use Kernel Trick(TM) to do this efficiently

  • Enforce/Encourage Smoothness with weight penalty

  • Introduce Margin

  • Find best solution using Quadratic Programming


Svm zip code recognition l.jpg

SVM: Zip Code recognition

  • Data dimension: 256

  • Feature Space: 4 th order

    • roughly 100,000,000 dims


The classical face detection process l.jpg

Larger

Scale

Smallest

Scale

The Classical Face Detection Process

50,000 Locations/Scales


Classifier is learned from labeled data l.jpg

Classifier is Learned from Labeled Data

  • Training Data

    • 5000 faces

      • All frontal

    • 108 non faces

    • Faces are normalized

      • Scale, translation

  • Many variations

    • Across individuals

    • Illumination

    • Pose (rotation both in plane and out)


Key properties of face detection l.jpg

Key Properties of Face Detection

  • Each image contains 10 - 50 thousand locs/scales

  • Faces are rare 0 - 50 per image

    • 1000 times as many non-faces as faces

  • Extremely small # of false positives: 10-6


Sung and poggio l.jpg

Sung and Poggio


Rowley baluja kanade l.jpg

Rowley, Baluja & Kanade

First Fast System

- Low Res to Hi


Osuna freund and girosi l.jpg

Osuna, Freund, and Girosi


Support vectors l.jpg

Support Vectors


P o g first pedestrian work l.jpg

P, O, & G: First Pedestrian Work


On to adaboost l.jpg

On to AdaBoost

  • Given a set of weak classifiers

    • None much better than random

  • Iteratively combine classifiers

    • Form a linear combination

    • Training error converges to 0 quickly

    • Test error is related to training margin


Adaboost l.jpg

Weak

Classifier 1

Weights

Increased

Weak

Classifier 2

Weak

classifier 3

Final classifier is

linear combination of weak classifiers

Freund & Shapire

AdaBoost


Adaboost properties l.jpg

AdaBoost Properties


Adaboost super efficient feature selector l.jpg

AdaBoost: Super Efficient Feature Selector

  • Features = Weak Classifiers

  • Each round selects the optimal feature given:

    • Previous selected features

    • Exponential Loss


Boosted face detection image features l.jpg

Boosted Face Detection: Image Features

“Rectangle filters”

Similar to Haar wavelets

Papageorgiou, et al.

Unique Binary Features


Feature selection l.jpg

Feature Selection

  • For each round of boosting:

    • Evaluate each rectangle filter on each example

    • Sort examples by filter values

    • Select best threshold for each filter (min Z)

    • Select best filter/threshold (= Feature)

    • Reweight examples

  • M filters, T thresholds, N examples, L learning time

    • O( MT L(MTN) ) Naïve Wrapper Method

    • O( MN ) Adaboost feature selector


Example classifier for face detection l.jpg

Example Classifier for Face Detection

A classifier with 200 rectangle features was learned using AdaBoost

95% correct detection on test set with 1 in 14084

false positives.

Not quite competitive...

ROC curve for 200 feature classifier


Building fast classifiers l.jpg

% False Pos

0

50

vs

false

neg

determined by

50 100

% Detection

T

T

T

T

IMAGE

SUB-WINDOW

Classifier 2

Classifier 3

FACE

Classifier 1

F

F

F

F

NON-FACE

NON-FACE

NON-FACE

NON-FACE

Building Fast Classifiers

  • Given a nested set of classifier hypothesis classes

  • Computational Risk Minimization


Other fast classification work l.jpg

Other Fast Classification Work

  • Simard

  • Rowley (Faces)

  • Fleuret & Geman (Faces)


Cascaded classifier l.jpg

Cascaded Classifier

  • A 1 feature classifier achieves 100% detection rate and about 50% false positive rate.

  • A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative)

    • using data from previous stage.

  • A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

50%

20%

2%

IMAGE

SUB-WINDOW

5 Features

20 Features

FACE

1 Feature

F

F

F

NON-FACE

NON-FACE

NON-FACE


Comparison to other systems l.jpg

10

31

50

65

78

95

110

167

422

Viola-Jones

78.3

85.2

88.8

90.0

90.1

90.8

91.1

91.8

93.7

Rowley-Baluja-Kanade

83.2

86.0

89.2

90.1

89.9

Schneiderman-Kanade

94.4

Roth-Yang-Ahuja

(94.8)

Comparison to Other Systems

False Detections

Detector


Output of face detector on test images l.jpg

Output of Face Detector on Test Images


Solving other face tasks l.jpg

Solving other “Face” Tasks

Profile Detection

Facial Feature Localization

Demographic

Analysis


Feature localization l.jpg

Feature Localization

  • Surprising properties of our framework

    • The cost of detection is not a function of image size

      • Just the number of features

    • Learning automatically focuses attention on key regions

  • Conclusion: the “feature” detector can include a large contextual region around the feature


Feature localization features l.jpg

Feature Localization Features

  • Learned features reflect the task


Profile detection l.jpg

Profile Detection


More results l.jpg

More Results


Profile features l.jpg

Profile Features


One nearest neighbor one nearest neighbor for fitting is described shortly l.jpg

Thanks to

Andrew Moore

One-Nearest Neighbor…One nearest neighbor for fitting is described shortly…

Similar to Join The Dots with two Pros and one Con.

  • PRO: It is easy to implement with multivariate inputs.

  • CON: It no longer interpolates locally.

  • PRO: An excellent introduction to instance-based learning…


1 nearest neighbor is an example of instance based learning l.jpg

Thanks to

Andrew Moore

1-Nearest Neighbor is an example of….Instance-based learning

Four things make a memory based learner:

  • A distance metric

  • How many nearby neighbors to look at?

  • A weighting function (optional)

  • How to fit with the local points?

x1y1

x2 y2

x3 y3

.

.

xn yn

A function approximator that has been around since about 1910.

To make a prediction, search database for similar datapoints, and fit with the local points.


Nearest neighbor l.jpg

Thanks to

Andrew Moore

Nearest Neighbor

Four things make a memory based learner:

  • A distance metricEuclidian

  • How many nearby neighbors to look at?One

  • A weighting function (optional)Unused

  • How to fit with the local points?Just predict the same output as the nearest neighbor.


Multivariate distance metrics l.jpg

Thanks to

Andrew Moore

Multivariate Distance Metrics

Suppose the input vectors x1, x2, …xn are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22) , …xN = ( xN1 , xN2 ).

One can draw the nearest-neighbor regions in input space.

The relative scalings in the distance metric affect region shapes.


Euclidean distance metric l.jpg

Thanks to

Andrew Moore

Euclidean Distance Metric

Other Metrics…

  • Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes’ Ringo system…)

Or equivalently,

where


Notable distance metrics l.jpg

Thanks to

Andrew Moore

Notable Distance Metrics


Simard tangent distance l.jpg

Simard: Tangent Distance


Simard tangent distance57 l.jpg

Simard: Tangent Distance


Feret photobook moghaddam pentland 1995 l.jpg

Thanks to

Baback Moghaddam

FERET Photobook Moghaddam & Pentland (1995)


Eigenfaces moghaddam pentland 1995 l.jpg

Normalized Eigenfaces

Thanks to

Baback Moghaddam

Eigenfaces Moghaddam & Pentland (1995)


Euclidean standard eigenfaces turk pentland 1992 moghaddam pentland 1995 l.jpg

Thanks to

Baback Moghaddam

Euclidean (Standard) “Eigenfaces” Turk & Pentland (1992) Moghaddam & Pentland (1995)

Projects all the training faces

onto a universal eigenspace

to “encode” variations (“modes”)

via principal components (PCA)

Uses inverse-distance

as a similarity measure

for matching & recognition


Euclidean similarity measures l.jpg

Thanks to

Baback Moghaddam

Euclidean Similarity Measures

  • Metric (distance-based) Similarity Measures

    • template-matching, normalized correlation, etc

  • Disadvantages

    • Assumes isotropic variation (that all variations are equi-probable)

    • Can not distinguish incidental changes from the critical ones

    • Particularly bad for Face Recognition in which so many are incidental!

      • for example: lighting and expression


Slide62 l.jpg

PCA-Based Density Estimation Moghaddam & Pentland ICCV’95

Perform PCA and factorize into (orthogonal)

Gaussians subspaces:

Solve for minimal KL divergence residual for the orthogonal subspace:

Thanks to

Baback Moghaddam

See Tipping & Bishop (97) for an ML derivation within a more general factor analysis framework (PPCA)


Bayesian face recognition moghaddam et al icpr 96 fg 98 nips 99 iccv 99 l.jpg

Intrapersonal

Extrapersonal

dual subspaces for dyads (image pairs)

Equate “similarity” with posterior on

PCA-based density estimation

Moghaddam ICCV’95

Thanks to

Baback Moghaddam

Bayesian Face Recognition Moghaddam et al ICPR’96, FG’98, NIPS’99, ICCV’99


Slide64 l.jpg

smile

smile

smile

light

specs

mouth

specs

smile

Intra

Extra

Standard

PCA

Thanks to

Baback Moghaddam

Intra-Extra (Dual) Subspaces


Slide65 l.jpg

Intra-Extra Subspace Geometry

Thanks to

Baback Moghaddam

Two “pancake” subspaces with different orientations intersecting near the origin. If each is in fact Gaussian, then the optimal discriminant is hyperquadratic


Bayesian similarity measure l.jpg

Thanks to

Baback Moghaddam

Bayesian Similarity Measure

  • Bayesian (MAP) Similarity

    • priors can be adjusted to reflect operational settings or used for Bayesian fusion (evidential “belief” from another level of inference)

  • Likelihood (ML) Similarity

Intra-only (ML) recognition is only slightly inferior to MAP (by few %). Therefore, if you had to pick only one subspace to work in, you should pick Intra – and not standard eigenfaces!


Slide67 l.jpg

FERET Identification: Pre-Test

Thanks to

Baback Moghaddam

Bayesian (Intra-Extra)

Standard (Eigenfaces)


Slide68 l.jpg

Official 1996 FERET Test

Thanks to

Baback Moghaddam

Bayesian (Intra-Extra)

Standard (Eigenfaces)


One nearest neighbor l.jpg

..let’s leave distance metrics for now, and go back to….

Thanks to

Andrew Moore

One-Nearest Neighbor

Objection:

That noise-fitting is really objectionable.

What’s the most obvious way of dealing with it?


K nearest neighbor l.jpg

Thanks to

Andrew Moore

k-Nearest Neighbor

Four things make a memory based learner:

  • A distance metricEuclidian

  • How many nearby neighbors to look at?

    k

  • A weighting function (optional)Unused

  • How to fit with the local points?Just predict the average output among the k nearest neighbors.


K nearest neighbor here k 9 l.jpg

Thanks to

Andrew Moore

k-Nearest Neighbor (here k=9)

K-nearest neighbor for function fitting smoothes away noise, but there are clear deficiencies.

What can we do about all the discontinuities that k-NN gives us?


Kernel regression l.jpg

Thanks to

Andrew Moore

Kernel Regression

Four things make a memory based learner:

  • A distance metricScaled Euclidian

  • How many nearby neighbors to look at?All of them

  • A weighting function (optional)wi = exp(-D(xi, query)2 / Kw2)

    Nearby points to the query are weighted strongly, far points weakly. The KW parameter is the Kernel Width. Very important.

  • How to fit with the local points?Predict the weighted average of the outputs:

    predict = Σwiyi /Σwi


Kernel regression in pictures l.jpg

Thanks to

Andrew Moore

Kernel Regression in Pictures

Take this dataset…

..and do a kernel prediction with xq (query) = 310, Kw = 50.


Varying the query l.jpg

Thanks to

Andrew Moore

Varying the Query

xq= 150

xq= 395


Varying the kernel width l.jpg

Thanks to

Andrew Moore

Varying the kernel width

Increasing the kernel width Kw means further away points get an opportunity to influence you.

As Kwinfinity, the prediction tends to the global average.


Kernel regression predictions l.jpg

Thanks to

Andrew Moore

Kernel Regression Predictions

Increasing the kernel width Kw means further away points get an opportunity to influence you.

As Kwinfinity, the prediction tends to the global average.


Kernel regression on our test cases l.jpg

Thanks to

Andrew Moore

Kernel Regression on our test cases

Choosing a good Kw is important. Not just for Kernel Regression, but for all the locally weighted learners we’re about to see.


Weighting functions l.jpg

Thanks to

Andrew Moore

Weighting functions

Let

d=D(xi,xquery)/KW

Then here are some commonly used weighting functions…

(we use a Gaussian)


Kernel regression can look bad l.jpg

Thanks to

Andrew Moore

Kernel Regression can look bad

Time to try something more powerful…


Locally weighted regression l.jpg

Thanks to

Andrew Moore

Locally Weighted Regression

Kernel Regression:

Take a very very conservative function approximator called AVERAGING. Locally weight it.

Locally Weighted Regression:

Take a conservative function approximator called LINEAR REGRESSION. Locally weight it.

Let’s Review Linear Regression….


Unweighted linear regression l.jpg

Thanks to

Andrew Moore

Unweighted Linear Regression

You’re lying asleep in bed. Then Nature wakes you.

YOU: “Oh. Hello, Nature!”

NATURE: “I have a coefficient β in mind. I took a bunch of real numbers called x1, x2 ..xN thus: x1=3.1,x2=2, …xN=4.5.

For each of them (k=1,2,..N), I generated yk= βxk+εk

where εk is a Gaussian (i.e. Normal) random variable with mean 0 and standard deviation σ. The εk’s were generated independently of each other.

Here are the resulting yi’s: y1=5.1 , y2=4.2 , …yN=10.2”

You: “Uh-huh.”

Nature: “So what do you reckon β is then, eh?”

WHAT IS YOUR RESPONSE?


Locally weighted regression82 l.jpg

Thanks to

Andrew Moore

Locally Weighted Regression

Four things make a memory-based learner:

  • A distance metricScaled Euclidian

  • How many nearby neighbors to look at?All of them

  • A weighting function (optional)wk = exp(-D(xk, xquery)2 / Kw2)Nearby points to the query are weighted strongly, far points weakly. The Kwparameter is the Kernel Width.

  • How to fit with the local points?

    • First form a local linear model. Find the β that minimizes the locally weighted sum of squared residuals:

Then predict ypredict=βTxquery


How lwr works l.jpg

Thanks to

Andrew Moore

How LWR works

Query

Linear regression not flexible but trains like lightning.

Locally weighted regression is very flexible and fast to train.


Lwr on our test cases l.jpg

Thanks to

Andrew Moore

LWR on our test cases


Features features features l.jpg

Features, Features, Features

  • In almost every case:

    Good Features beat Good Learning

    Learning beats No Learning

  • Critical classifier ratio:

    • AdaBoost >> SVM


  • Login