1 / 100

# Dense Object Recognition - PowerPoint PPT Presentation

Dense Object Recognition. 2. Template Matching. Face Detection. We will investigate face detection using a scanning window technique:. Think that this task sounds easy?. Training Data. Non-Faces. Faces. 800 random non-face regions 60x60, taken from same data as faces.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Dense Object Recognition' - josie

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

2. Template Matching

We will investigate face detection using a scanning window technique:

Think that this task sounds easy?

Non-Faces

Faces

800 random non-face regions 60x60, taken from same data as faces

800 face images 60x60, taken from online dating website

x1

x2

x3

……..

xN

Concatenate face pixels into

“vector”, x.

• GENERATIVE APPROACH

• Calculate models for data likelihood given each class

• Compare likelihoods – in this case we will just calculate the likelihood ratio:

• Threshold likelihood ratio to decide if face / non-face

All that remains is to specify form of likelihood terms

denotes a n-dimensional Gaussian or Normal distribution in the variable x with mean m and symmetric positive definite covariance matrix, S which comes in three flavours:

Fit model using maximum likelihood criterion

m face

m non-face

Pixel 2

• Face

• 59.1

• non-face

• 69.1

Pixel 1

Face ‘template’

Results based on 200 cropped faces and 200 non-faces from the same database.

How does this work with a real image?

Pr(Hit)

Pr(False Alarm)

Maxima in log like ratio

Maxima in log like ratio

Maxima in log like ratio

Scale 1

Scale 2

Scale 3

Before Thresholding

Before Thresholding

Before Thresholding

After Thresholding

After Thresholding

After Thresholding

Original Image

Superimposed log like ratio

Detected Faces

Positions of maxima

Fit model using maximum likelihood criterion

m face

m non-face

Pixel 2

• Face

• non-face

Pixel 1

Results based on 200 cropped faces and 200 non-faces from the same database.

More sophisticated model unsurprisingly classifies new faces and non-faces better.

Pr(Hit)

Diagonal

Uniform

Pr(False Alarm)

Fit model using maximum likelihood criterion

PROBLEM: we cannot fit this model. We don’t have enough data to estimate the full covariance matrix.

N=800 training images

D=10800 dimensions

Total number of measured numbers =

ND = 800x10,800 = 8,640,000

Total number of parameters in cov matrix = (D+1)D/2 = (10,800+1)x10,800/2 = 58,325,400

Pixel 2

Pixel 1

We could induce some covariance by using a mixtures of Gaussians model in which each component is uniform or diagonal. For small number of mixture components, the number of parameters is not too bad.

Pixel 2

Pixel 2

Pixel 1

Pixel 1

For diagonal Gaussians, there are 2D+1 unknowns per component (D parameters for mean, D for diagonal covariance, and 1 for the weight of the Gaussian). i.e. For K components K(2D+1).

3. Mixtures of Templates

Key idea: represent probability as weighted sum (mixture) of Gaussian distributions. Weights must sum to 1 or not a pdf.

Pr(x)

x

x

Try to think about the same problem in a different way...

Marginalize over h

• ASSUMPTIONS

• for each training datum xi there is a hidden variable hi.

• hi represents which Gaussian xi came from

• hence hi takes discrete values

• OUR GOAL:

• To estimate the parameters q:

• means m,

• variances s2

• weights w

• for each of the K components.

THING TO NOTICE #1:

If we knew the hidden variables hi for the training data it would very easy to estimate parameters q – just estimate individual Gaussians separately.

THING TO NOTICE #2:

If we knew the parameters q it would very easy to estimate the posterior distribution over the each hidden variables hi using Bayes’ rule:

Pr(x|h=3)

Pr(h|x)

Pr(x|h=2)

Pr(x|h=1)

h=1

h=2

h=3

• Chicken and egg problem:

• could find h1...N if we knew q

• could find q if we knew h1...N

Solution: Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 1977)

• Alternate between:

• 1. Expectation Step (E-Step)

• For fixed q find posterior distribution over h1...N

• 2. Maximization Step (M-Step)

• Given these distributions,maximize lower bound on likelihood w.r.t. q

0.4999

0.5001

Prior

The face model and non-face model have divided the data into two clusters. In each case, these clusters have roughly equal weights.

The primary thing that these seem to have captured is the photometric (luminance) variation.

Note that the standard deviations have become smaller than for the single Gaussian model as any given data point is likely to be close to one mean or the other.

Mean

Face Model Parameters

Standard deviation

0.5325

0.4675

Prior

Mean

Non-Face Model Parameters

Standard deviation

Performance improves relative to a single Gaussian model, although it is not dramatic.

We have a better description of the data likelihood.

Pr(Hit)

MOG 2

Diagonal

Uniform

Pr(False Alarm)

0.0988

0.1925

0.2062

0.2275

0.1575

Prior

Mean

Face Model Parameters

Standard deviation

0.1737

0.2250

0.1950

0.2200

0.1863

Prior

Mean

Non-Face Model Parameters

Standard deviation

0.0075

0.1425

0.1437

0.0988

0.1038

0.1187

0.1638

0.1175

0.1038

0.0000

0.1137

0.0688

0.0763

0.0800

0.1338

0.1063

0.1063

0.1263

0.0900

0.0988

Performance improves slightly more, particularly at low false alarm rates.

What if we move to an infinite number of Gaussians?

Pr(Hit)

MOG 10

MOG 2

Diagonal

Uniform

Pr(False Alarm)

4. Subspace models: factor analysis

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line?

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

h=2

Pixel 1

h=1

h=0

h=-1

h=-2

Hidden Variable

Pixel 1

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Now consider weighting the constituent Gaussians...

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

If weights decrease with distance from central point, can get something like oriented Gaussian

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

Pixel 2

f

Marginalize

over h

Pixel 2

m

Pixel 1

h=2

Pixel 1

h=1

h=0

h=-1

h=-2

Hidden Variable

Pixel 1

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Now consider weighting the constituent Gaussians...

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Weight components by another Gaussian distribution with mean 0 and variance 1

• This integral does actually evaluate to make a new Gaussian whose principal axis is oriented along the line given by m+kf.

• This is not obvious!

• The line along which the Gaussians are placed is termed a subspace.

• Since h was just a number and there was one column in f it was a one dimensional subspace.

• This is not necessarily the case though but dh< dxalways holds

For a general subspace of dh dimensions in a larger space of size dx

• F has dim(h) columns each of length dx– these are termed factors.

• They are basis vectors span the subspace

• h now weights these basis vectors to define a position in the subspace

• Concrete example: 2D subspace in a 3D space

• F will contain two 3D vectors in its columns, spanning plane subspace

• h determines the weighting of these vectors

• h determines the position on the plane

• We have considered factor analysis as an infinite mixture of Gaussians, but there are other ways to think about it.

• Consider a rule for creating new data points xi

• Created from some smaller underlying random variables hi

h

• To generate:

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

x

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

.

.

x1 = m+Fh1 + e

.

e

x1

h1

.

x2

.

.

HIDDEN DIM 2

OBSERVED DIM 2

.

h2

x3

OBSERVED DIM 3

h3

OBSERVED DIM 1

HIDDEN DIM 1

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

h

Equivalent Description:

x

Joint Distribution: (marginalize to get Pr(x))

For a general subspace of dh dimensions in a larger space of size dx.

• Factor analysis covariance has:

• dhdx parameters in the factor matrix, F

• dx parameters in the covariance, S

This gives a total of dx (dh+1) parameters.

If dh is reasonably small, and dx is large then this is much less than the full covariance which has dx(dx+1)/2.

It is a reasonable assumption that an ensemble of images (like faces) genuinely lie largely within a subspace of the very high-dimensional image space so this is not a bad model.

• But given some data, how to we estimate F, S, and m?

• Unfortunately, to do this, we will need some more maths!

Interlude: Gaussian and Matrix Identities

Multivariate generalization of 1D Gaussian or Normal distribution. Depends on mean vector m and (symmetric, positive, definite) covariance matrix S. The multivariate normal distribution has PDF:

where n is the dimensionality of the space under consideration.

Gaussian Identity #1:Multiplication of Gaussians

Property: When we multiply two Gaussian distributions (common when applying Bayes’ rule) then the resulting distribution is also Gaussian. In particular:

where:

The normalization constant is also Gaussian in either a or b. Intuitively you can see that the product must be a Gaussian, as each of the original Gaussians has an exponent that is quadratic in x. When we multiply the two Gaussians, we add the exponents giving another quadratic.

-

=

-

where we have removed the terms that do not depend on x and placed them in the constant, k. It can be seen from the quadratic term that this looks like a Gaussian with covariance:

Re-arranging:

As required.

-

-

-

-

Consider a Gaussian in x with a mean that is a linear function, H of y. We can re-arrange to express this in terms of a Gaussian in y:

Proof:

Looking at the quadratic term in y, it resembles the quadratic term of a Gaussian in y with covariance:

Re-arranging:

As Required

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. The following equality holds:

Proof:

Taking the inverse of both sides:

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. The following equality holds:

This is known as the Matrix Inversion Lemma.

Proof:

As required

1.

2.

3.

4.

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. Then:

5.

6.

(returning to)

4. Subspace models: factor analysis

GOAL: Given a data set x1...N, estimate factor analysis model parameters q = {m,F,S}.

Let’s make life somewhat easier: it is fairly obvious that

We’ll use this estimate, and subtract the mean from each of the training vectors to make a slightly simpler generative model

Goal: Learn parameters defining model,q={F,S}.

Problem: Hard to estimate parameters q since we don’t know the latent identity vectors, h.

Method: Expectation Maximization (EM) algorithm. Alternately perform E-Step and M-Step until convergence:

• E-STEP: Calculate the posterior distribution over the latent identity variable, Pr(h|x,q)

• M-STEP: Maximize the likelihood of the parameters q using expected values of h

Can express this as:

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Generative Model

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

In the E-Step, we use Bayes rule to find the distribution for the identity vector h given the observed data, x:

In this simple subspace model, both of the terms in the denominator are Gaussian so this posterior probability for h can be calculated in closed form.

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Probabilistic Inversion via Bayes’ Rule

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

Let’s consider just the numerator of this expression, since the denominator is just a scaling constant

Now apply Gaussian Relation #2 to the first term

to give:

Notice that we have a Gaussian times a Gaussian in the same variable here – this must make a Gaussian result. To find the mean and covariance of this, we use Gaussian relation #1

This distribution has moments around the mean which are given by:

We can reformulate these terms using our two matrix relations:

We can reformulate these terms using our two matrix relations:

to give:

Why should we bother to do this? Well, the matrices in brackets at the top are dxx dx. whereas matrices at the bottom are dh x dh

In the E-Step, we use Bayes rule to find the distribution for the identity vector h given the observed data, x:

In this simple subspace model, both of the terms in the denominator are Gaussian so this posterior probability for h can be calculated in closed form.

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Probabilistic Inversion via Bayes’ Rule

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

Objective function is joint log likelihood of latent variables and data:

Using expected values of h, write take derivatives of log likelihood, set to zero and solve for parameters q, substituting in the expected values of h.

F1

F2

m

S

m+2F1

m+2F2

F1

F2

m

S

m+2F1

m+2F2

Can calculate factor analysis performance for face detection in terms of a receiving operator characteristic curve.

m

S

F2

F1

F4

F5

F3

m+2F1

m+2F2

m+2F3

m+2F4

m+2F5

m

S

F2

F1

F4

F5

F3

m+2F1

m+2F2

m+2F3

m+2F4

m+2F5

Can calculate factor analysis performance for face detection in terms of a receiving operator characteristic curve.

• To generate:

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

Factors are ambiguous up to a rotation:

There is an infinite set of equivalent models each of which has the same probability.

Mixture of factor analyzers (MOFA)

• Two levels of the EM algorithm

• One to learn each factor analyzer

• One to learn the mixture model

• Learning subject to local minima

• Can describe quite complex manifold structures in high dimensions with only a limited number of parameters

Pixel 2

Pixel 1

Gaussian Process Latent Variable Models

• Non-linear version of factor analysis

• Still a latent space, but now function mapping latent to observed space is nonlinear

• Learning subject to local minima

Pixel 2

Pixel 1

7. Relationship to non-probabilistic methods

• Factor analysis is very closely related to another common technique in computer vision: principal component analysis (PCA).

• Motivation of PCA is quite different from that for factor analysis.

• It is not probabilistic

• It is primarily concerned with dimensionality reduction

• Dimensionality Reduction

• Consider the hidden space as a smaller set of numbers that can approximately describe the image.

+ h3

+ h1

+ h2

m + h1f1 + h2f2 + h3f3 +…

x’

.

.

.

e

x1

h1

.

x2

.

.

HIDDEN DIM 2

OBSERVED DIM 2

.

h2

x3

OBSERVED DIM 3

h3

OBSERVED DIM 1

HIDDEN DIM 1

• face is approximately represented by the weighted sums of the factors.

• h (low dimensional) can be used as a proxy for x (high dimensional)

• KEY IDEAS:

• Describe data as multivariate Gaussian

• Project data onto axes of this Gaussian with largest variance

• Discard all but the largest few dimensions

• Finds a small set of numbers that describes as much of the variance in the dataset as possible (dimensionality reduction).

s2

x2

s1

x1

Bivariate Axis-Aligned Gaussian

x2

x1

Bivariate Non-Axis Aligned Distribution

X’2

X’1

• Mean and covariance matrix of data define a Gaussian model

• Mean

• Covariance

As before, we break down this covariance matrix into the product of three other matrices:

where U is a rotation matrix that transforms the principal axes of the fitted Gaussian back to the original co-ordinate system

• If S is an m x m covariance matrix, there exist m linearly independent eigenvectors, and all the corresponding eigenvalues are non-negative.

• We can decompose S as

• Compute eigenvectors of covariance,

• Eigenvectors : main directions

• Eigenvalue : variance along eigenvector

• Co-ords often correlated

• Nearby points move together

• Data lies in subspace of reduced dim.

• However, for some p,

• Each element of the data can be written

• Factor analysis gives a probability

• Factor analysis has a separate noise parameter for each dimension

• Factors are arbitrary length, but principal components length 1