This presentation is the property of its rightful owner.
1 / 100

# Dense Object Recognition PowerPoint PPT Presentation

Dense Object Recognition. 2. Template Matching. Face Detection. We will investigate face detection using a scanning window technique:. Think that this task sounds easy?. Training Data. Non-Faces. Faces. 800 random non-face regions 60x60, taken from same data as faces.

Dense Object Recognition

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Dense Object Recognition

2. Template Matching

Face Detection

We will investigate face detection using a scanning window technique:

Think that this task sounds easy?

### Training Data

Non-Faces

Faces

800 random non-face regions 60x60, taken from same data as faces

800 face images 60x60, taken from online dating website

### Vectorizing Images

x1

x2

x3

……..

xN

Concatenate face pixels into

“vector”, x.

### Overview of Approach

• GENERATIVE APPROACH

• Calculate models for data likelihood given each class

• Compare likelihoods – in this case we will just calculate the likelihood ratio:

• Threshold likelihood ratio to decide if face / non-face

All that remains is to specify form of likelihood terms

### The Multivariate Gaussian

denotes a n-dimensional Gaussian or Normal distribution in the variable x with mean m and symmetric positive definite covariance matrix, S which comes in three flavours:

### Model # 1: Gaussian, uniform covariance

Fit model using maximum likelihood criterion

m face

m non-face

Pixel 2

• Face

• 59.1

• non-face

• 69.1

Pixel 1

Face ‘template’

### Model 1 Results

Results based on 200 cropped faces and 200 non-faces from the same database.

How does this work with a real image?

Pr(Hit)

Pr(False Alarm)

### Scale 1

Maxima in log like ratio

### Scale 2

Maxima in log like ratio

### Scale 3

Maxima in log like ratio

Threshold Maxima

Scale 1

Scale 2

Scale 3

Before Thresholding

Before Thresholding

Before Thresholding

After Thresholding

After Thresholding

After Thresholding

### Results

Original Image

Superimposed log like ratio

Detected Faces

Positions of maxima

### Model # 2: Gaussian, diagonal covariance

Fit model using maximum likelihood criterion

m face

m non-face

Pixel 2

• Face

• non-face

Pixel 1

### Model 2 Results

Results based on 200 cropped faces and 200 non-faces from the same database.

More sophisticated model unsurprisingly classifies new faces and non-faces better.

Pr(Hit)

Diagonal

Uniform

Pr(False Alarm)

### Model # 2: Gaussian, full covariance

Fit model using maximum likelihood criterion

PROBLEM: we cannot fit this model. We don’t have enough data to estimate the full covariance matrix.

N=800 training images

D=10800 dimensions

Total number of measured numbers =

ND = 800x10,800 = 8,640,000

Total number of parameters in cov matrix = (D+1)D/2 = (10,800+1)x10,800/2 = 58,325,400

Pixel 2

Pixel 1

### Possible Solution

We could induce some covariance by using a mixtures of Gaussians model in which each component is uniform or diagonal. For small number of mixture components, the number of parameters is not too bad.

Pixel 2

Pixel 2

Pixel 1

Pixel 1

For diagonal Gaussians, there are 2D+1 unknowns per component (D parameters for mean, D for diagonal covariance, and 1 for the weight of the Gaussian). i.e. For K components K(2D+1).

### Dense Object Recognition

3. Mixtures of Templates

### Mixture of Gaussians

Key idea: represent probability as weighted sum (mixture) of Gaussian distributions. Weights must sum to 1 or not a pdf.

Pr(x)

x

x

### Hidden Variable Interpretation

Try to think about the same problem in a different way...

Marginalize over h

### Hidden Variable Interpretation

• ASSUMPTIONS

• for each training datum xi there is a hidden variable hi.

• hi represents which Gaussian xi came from

• hence hi takes discrete values

• OUR GOAL:

• To estimate the parameters q:

• means m,

• variances s2

• weights w

• for each of the K components.

THING TO NOTICE #1:

If we knew the hidden variables hi for the training data it would very easy to estimate parameters q – just estimate individual Gaussians separately.

### Hidden Variable Interpretation

THING TO NOTICE #2:

If we knew the parameters q it would very easy to estimate the posterior distribution over the each hidden variables hi using Bayes’ rule:

Pr(x|h=3)

Pr(h|x)

Pr(x|h=2)

Pr(x|h=1)

h=1

h=2

h=3

### Expectation Maximization

• Chicken and egg problem:

• could find h1...N if we knew q

• could find q if we knew h1...N

Solution: Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 1977)

• Alternate between:

• 1. Expectation Step (E-Step)

• For fixed q find posterior distribution over h1...N

• 2. Maximization Step (M-Step)

• Given these distributions,maximize lower bound on likelihood w.r.t. q

### MOG 2 Components

0.4999

0.5001

Prior

The face model and non-face model have divided the data into two clusters. In each case, these clusters have roughly equal weights.

The primary thing that these seem to have captured is the photometric (luminance) variation.

Note that the standard deviations have become smaller than for the single Gaussian model as any given data point is likely to be close to one mean or the other.

Mean

Face Model Parameters

Standard deviation

0.5325

0.4675

Prior

Mean

Non-Face Model Parameters

Standard deviation

### Results for MOG 2 Model

Performance improves relative to a single Gaussian model, although it is not dramatic.

We have a better description of the data likelihood.

Pr(Hit)

MOG 2

Diagonal

Uniform

Pr(False Alarm)

### MOG 5 Components

0.0988

0.1925

0.2062

0.2275

0.1575

Prior

Mean

Face Model Parameters

Standard deviation

0.1737

0.2250

0.1950

0.2200

0.1863

Prior

Mean

Non-Face Model Parameters

Standard deviation

### MOG 10 Components

0.0075

0.1425

0.1437

0.0988

0.1038

0.1187

0.1638

0.1175

0.1038

0.0000

0.1137

0.0688

0.0763

0.0800

0.1338

0.1063

0.1063

0.1263

0.0900

0.0988

Results for MOG 2 Model

Performance improves slightly more, particularly at low false alarm rates.

What if we move to an infinite number of Gaussians?

Pr(Hit)

MOG 10

MOG 2

Diagonal

Uniform

Pr(False Alarm)

### Dense Object Recognition

4. Subspace models: factor analysis

### Factor Analysis: Intuitions

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line?

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

### Factor Analysis: Intuitions

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

h=2

Pixel 1

h=1

h=0

h=-1

h=-2

Hidden Variable

Pixel 1

### Factor Analysis: Intuitions

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Now consider weighting the constituent Gaussians...

### Factor Analysis: Intuitions

Consider putting the means of the Gaussians mixture components all on a line and forcing their diagonal covariances to be identical.

What happens if we keep adding more and more Gaussians along this line? In the limit the hidden variable become continuous

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

If weights decrease with distance from central point, can get something like oriented Gaussian

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

f

Pixel 2

Marginalize

over h

Pixel 2

m

Pixel 1

h=1

Pixel 1

h=0

h=-1

Hidden Variable

Pixel 1

Pixel 2

f

Marginalize

over h

Pixel 2

m

Pixel 1

h=2

Pixel 1

h=1

h=0

h=-1

h=-2

Hidden Variable

Pixel 1

### Factor Analysis Maths

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Now consider weighting the constituent Gaussians...

### Factor Analysis Maths

Pixel 2

Marginalize

over h

Pixel 2

Pixel 1

CONTINUOUS

Pixel 1

Hidden Variable

Pixel 1

Weight components by another Gaussian distribution with mean 0 and variance 1

### Factor Analysis Maths

• This integral does actually evaluate to make a new Gaussian whose principal axis is oriented along the line given by m+kf.

• This is not obvious!

• The line along which the Gaussians are placed is termed a subspace.

• Since h was just a number and there was one column in f it was a one dimensional subspace.

• This is not necessarily the case though but dh< dxalways holds

### Factor Analysis Maths

For a general subspace of dh dimensions in a larger space of size dx

• F has dim(h) columns each of length dx– these are termed factors.

• They are basis vectors span the subspace

• h now weights these basis vectors to define a position in the subspace

• Concrete example: 2D subspace in a 3D space

• F will contain two 3D vectors in its columns, spanning plane subspace

• h determines the weighting of these vectors

• h determines the position on the plane

### A Generative View

• We have considered factor analysis as an infinite mixture of Gaussians, but there are other ways to think about it.

• Consider a rule for creating new data points xi

• Created from some smaller underlying random variables hi

h

• To generate:

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

x

### A Generative View

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

.

.

x1 = m+Fh1 + e

.

e

x1

h1

.

x2

.

.

HIDDEN DIM 2

OBSERVED DIM 2

.

h2

x3

OBSERVED DIM 3

h3

OBSERVED DIM 1

HIDDEN DIM 1

A Generative View

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

h

Equivalent Description:

x

Joint Distribution: (marginalize to get Pr(x))

### Factor Analysis Parameter Count

For a general subspace of dh dimensions in a larger space of size dx.

• Factor analysis covariance has:

• dhdx parameters in the factor matrix, F

• dx parameters in the covariance, S

This gives a total of dx (dh+1) parameters.

If dh is reasonably small, and dx is large then this is much less than the full covariance which has dx(dx+1)/2.

It is a reasonable assumption that an ensemble of images (like faces) genuinely lie largely within a subspace of the very high-dimensional image space so this is not a bad model.

• But given some data, how to we estimate F, S, and m?

• Unfortunately, to do this, we will need some more maths!

### Dense Object Recognition

Interlude: Gaussian and Matrix Identities

### Multivariate Normal Distribution

Multivariate generalization of 1D Gaussian or Normal distribution. Depends on mean vector m and (symmetric, positive, definite) covariance matrix S. The multivariate normal distribution has PDF:

where n is the dimensionality of the space under consideration.

### Gaussian Identity #1:Multiplication of Gaussians

Property: When we multiply two Gaussian distributions (common when applying Bayes’ rule) then the resulting distribution is also Gaussian. In particular:

where:

The normalization constant is also Gaussian in either a or b. Intuitively you can see that the product must be a Gaussian, as each of the original Gaussians has an exponent that is quadratic in x. When we multiply the two Gaussians, we add the exponents giving another quadratic.

Proof:

-

=

-

where we have removed the terms that do not depend on x and placed them in the constant, k. It can be seen from the quadratic term that this looks like a Gaussian with covariance:

Completing the Square:

Re-arranging:

As required.

-

-

-

-

### Gaussian Identity #2

Consider a Gaussian in x with a mean that is a linear function, H of y. We can re-arrange to express this in terms of a Gaussian in y:

Proof:

Looking at the quadratic term in y, it resembles the quadratic term of a Gaussian in y with covariance:

Completing the Square:

Re-arranging:

As Required

### Matrix Identity #1

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. The following equality holds:

Proof:

Taking the inverse of both sides:

### Matrix Identity 2: The Matrix Inversion Lemma

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. The following equality holds:

This is known as the Matrix Inversion Lemma.

Proof:

Remember Matrix Identity 1

As required

### Maths Review

1.

2.

3.

4.

Consider the d x d matrix, P, the k x k matrix R, and the k x d matrix H, where P and R are symmetric, positive definite, covariance matrices. Then:

5.

6.

### Dense Object Recognition

(returning to)

4. Subspace models: factor analysis

### Learning Factor Analysis Models

GOAL: Given a data set x1...N, estimate factor analysis model parameters q = {m,F,S}.

Let’s make life somewhat easier: it is fairly obvious that

We’ll use this estimate, and subtract the mean from each of the training vectors to make a slightly simpler generative model

### Learning Factor Analysis Models

Goal: Learn parameters defining model,q={F,S}.

Problem: Hard to estimate parameters q since we don’t know the latent identity vectors, h.

Method: Expectation Maximization (EM) algorithm. Alternately perform E-Step and M-Step until convergence:

• E-STEP: Calculate the posterior distribution over the latent identity variable, Pr(h|x,q)

• M-STEP: Maximize the likelihood of the parameters q using expected values of h

### Learning: E-Step

Can express this as:

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Generative Model

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

### Learning: E-Step

In the E-Step, we use Bayes rule to find the distribution for the identity vector h given the observed data, x:

In this simple subspace model, both of the terms in the denominator are Gaussian so this posterior probability for h can be calculated in closed form.

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Probabilistic Inversion via Bayes’ Rule

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

### Learning: E-Step

Let’s consider just the numerator of this expression, since the denominator is just a scaling constant

Now apply Gaussian Relation #2 to the first term

to give:

### Learning: E-Step

Notice that we have a Gaussian times a Gaussian in the same variable here – this must make a Gaussian result. To find the mean and covariance of this, we use Gaussian relation #1

### Learning: E-Step

This distribution has moments around the mean which are given by:

We can reformulate these terms using our two matrix relations:

### Learning E-Step

We can reformulate these terms using our two matrix relations:

to give:

Why should we bother to do this? Well, the matrices in brackets at the top are dxx dx. whereas matrices at the bottom are dh x dh

### Learning: E-Step

In the E-Step, we use Bayes rule to find the distribution for the identity vector h given the observed data, x:

In this simple subspace model, both of the terms in the denominator are Gaussian so this posterior probability for h can be calculated in closed form.

.

.

HIDDEN DIM 2

OBSERVED DIM 2

Probabilistic Inversion via Bayes’ Rule

h

x

OBSERVED DIM 3

OBSERVED DIM 1

HIDDEN DIM 1

### Learning: M-Step

Objective function is joint log likelihood of latent variables and data:

Using expected values of h, write take derivatives of log likelihood, set to zero and solve for parameters q, substituting in the expected values of h.

F1

F2

m

S

m+2F1

m+2F2

F1

F2

m

S

m+2F1

m+2F2

### Factor Analysis Performance

Can calculate factor analysis performance for face detection in terms of a receiving operator characteristic curve.

m

S

F2

F1

F4

F5

F3

m+2F1

m+2F2

m+2F3

m+2F4

m+2F5

m

S

F2

F1

F4

F5

F3

m+2F1

m+2F2

m+2F3

m+2F4

m+2F5

### Factor Analysis Performance

Can calculate factor analysis performance for face detection in terms of a receiving operator characteristic curve.

Sampling from 10 parameter model

• To generate:

• Multiply by factors, F

• add random noise component ei w/ diagonal covS

### Rotational Ambiguity

Factors are ambiguous up to a rotation:

There is an infinite set of equivalent models each of which has the same probability.

### Non-Linear Extensions 1

Mixture of factor analyzers (MOFA)

• Two levels of the EM algorithm

• One to learn each factor analyzer

• One to learn the mixture model

• Learning subject to local minima

• Can describe quite complex manifold structures in high dimensions with only a limited number of parameters

Pixel 2

Pixel 1

### Non-linear Extensions 2

Gaussian Process Latent Variable Models

• Non-linear version of factor analysis

• Still a latent space, but now function mapping latent to observed space is nonlinear

• Learning subject to local minima

Pixel 2

Pixel 1

### Dense Object Recognition

7. Relationship to non-probabilistic methods

### Factor Analysis and PCA

• Factor analysis is very closely related to another common technique in computer vision: principal component analysis (PCA).

• Motivation of PCA is quite different from that for factor analysis.

• It is not probabilistic

• It is primarily concerned with dimensionality reduction

• Dimensionality Reduction

• Consider the hidden space as a smaller set of numbers that can approximately describe the image.

### Dimensionality Reduction

+ h3

+ h1

+ h2

m + h1f1 + h2f2 + h3f3 +…

x’

.

.

.

e

x1

h1

.

x2

.

.

HIDDEN DIM 2

OBSERVED DIM 2

.

h2

x3

OBSERVED DIM 3

h3

OBSERVED DIM 1

HIDDEN DIM 1

• face is approximately represented by the weighted sums of the factors.

• h (low dimensional) can be used as a proxy for x (high dimensional)

### Principal Components Analysis

• KEY IDEAS:

• Describe data as multivariate Gaussian

• Project data onto axes of this Gaussian with largest variance

• Discard all but the largest few dimensions

• Finds a small set of numbers that describes as much of the variance in the dataset as possible (dimensionality reduction).

s2

x2

s1

x1

x2

x1

X’2

X’1

### Fitting a gaussian

• Mean and covariance matrix of data define a Gaussian model

• Mean

• Covariance

### Eigen-Decomposition

As before, we break down this covariance matrix into the product of three other matrices:

where U is a rotation matrix that transforms the principal axes of the fitted Gaussian back to the original co-ordinate system

### Eigenvector Decomposition

• If S is an m x m covariance matrix, there exist m linearly independent eigenvectors, and all the corresponding eigenvalues are non-negative.

• We can decompose S as

### Principal Component Analysis

• Compute eigenvectors of covariance,

• Eigenvectors : main directions

• Eigenvalue : variance along eigenvector

### Dimensionality Reduction

• Co-ords often correlated

• Nearby points move together

### Dimensionality Reduction

• Data lies in subspace of reduced dim.

• However, for some p,

### Approximation

• Each element of the data can be written

### Comparison of PCA and Factor Analysis

• Factor analysis gives a probability

• Factor analysis has a separate noise parameter for each dimension

• Factors are arbitrary length, but principal components length 1