# Fisher kernels for image representation & generative classification models - PowerPoint PPT Presentation

1 / 26

Fisher kernels for image representation & generative classification models. Jakob Verbeek December 11, 2009. Plan for this course. Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Fisher kernels for image representation & generative classification models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Fisher kernels for image representation & generative classification models

Jakob Verbeek

December 11, 2009

### Plan for this course

• Introduction to machine learning

• Clustering techniques

• k-means, Gaussian mixture density

• Gaussian mixture density continued

• Parameter estimation with EM

• Classification techniques 1

• Introduction, generative methods, semi-supervised

• Fisher kernels

• Classification techniques 2

• Discriminative methods, kernels

• Decomposition of images

• Topic models, …

### Classification

Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y.

Goal is to correctly predict for a test data input the corresponding class label.

Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels.

Example:

Input: image

Output: category label, eg “cat” vs. “no cat”

Classification can be binary (two classes), or over a larger number of classes (multi-class).

In binary classification we often refer to one class as “positive”, and the other as “negative”

Binary classifier creates a boundaries in the input space between areas assigned to each class

### Example of classification

Given: training images and their categories

What are the categories of these test images?

### Discriminative vs generative methods

Generative probabilistic methods

Model the density of inputs x from each class p(x|y)

Estimate class prior probability p(y)

Use Bayes’ rule to infer distribution over class given input

Discriminative (probabilistic) methods

Directly estimate class probability given input: p(y|x)

Some methods do not have probabilistic interpretation,

eg. they fit a function f(x), and assign to class 1 if f(x)>0,

and to class 2 if f(x)<0

### Generative classification methods

Generative probabilistic methods

Model the density of inputs x from each class p(x|y)

Estimate class prior probability p(y)

Use Bayes’ rule to infer distribution over class given input

Modeling class-conditional densities over the inputs x

Selection of model class:

Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), …

Semi-parametric models: mixtures of Gaussian, Bernoulli, …

Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator

Given class conditional model, classification is trivial: just apply Bayes’ rule

Adding new classes can be done by adding a new class conditional model

Existing class conditional models stay as they are

### Histogram methods

Suppose we

have N data points

use a histogram with C cells

How to set the density level in each cell ?

Maximum (log)-likelihood estimator.

Proportional to nr of points n in cell

Inversely proportional to volume V of cell

Problems with histogram method:

# cells scales exponentially with the dimension of the data

Discontinuous density estimate

How to choose cell size?

### The ‘curse of dimensionality’

Number of bins increases exponentially with the dimensionality of the data.

Fine division of each dimension: many empty bins

Rough division of each dimension: poor density model

Probability distribution of D discrete variables takes at least 2D values

At least 2 values for each variable

The number of cells may be reduced assuming independency between the components of x: the naïve Bayes model

Model is “naïve” since it assumes that all variables are independent…

Unrealistic for high dimensional data, where variables tend to be dependent

Poor density estimator

Classification performance can still be good using derived p(y|x)

### Example of generative classification

Hand-written digit classification

Input: binary 28x28 scanned digit images, collect in 784 long vector

Desired output: class label of image

Generative model

Independent Bernoulli model for each class

Probability per pixel per class

Maximum likelihood estimator is average value

per pixel per class

Classify using Bayes’ rule:

### k-nearest-neighbor estimation method

Idea: fix number of samples in the cell, find the right cell size.

Probability to find a point in a sphere A centered on x with volume v is

Smooth density approximately constant in small region, and thus

Alternatively: estimate P from the fraction of training data in a sphere on x

Combine the above to obtain estimate

### k-nearest-neighbor estimation method

Method in practice:

Choose k

For given x, compute the volume v which contain k samples.

Estimate density with

Volume of a sphere with radius r in d dimensions is

What effect does k have?

Data sampled from mixture

of Gaussians plotted in green

Larger k, larger region,

smoother estimate

Selection of k

Leave-one-out cross validation

Select k that maximizes data

log-likelihood

### k-nearest-neighbor classification rule

Use k-nearest neighbor density estimation to find p(x|category)

Apply Bayes rule for classification: k-nearest neighbor classification

Find sphere volume v to capture k data points for estimate

Use the same sphere for each class for estimates

Estimate global class priors

Calculate class posterior distribution

### k-nearest-neighbor classification rule

Effect of k on classification boundary

Larger number of neighbors

Larger regions

Smoother class boundaries

### Kernel density estimation methods

Consider a simple estimator of the cumulative distribution function:

Derivative gives an estimator of the density function, but this is just a set of delta peaks.

Derivative is defined as

Consider a non-limiting value of h:

Each data point adds 1/(2hN) in region of size h around it, sum of “blocks” gives estimate

### Kernel density estimation methods

Can use other than “block” function to obtain smooth estimator.

Widely used kernel function is the (multivariate) Gaussian

Contribution decreases smoothly as a function of the distance to data point.

Choice of smoothing parameter

Larger size of “kernel” function gives

smoother desnity estimator

Use the average distance between samples.

Use cross-validation.

Method can be used for multivariate data

Or in naïve bayes model

### Summary generative classification methods

(Semi-) Parametric models (eg p(data |category) = gaussian or mixture)

No need to store data, but possibly too strong assumptions on data density

Can lead to poor fit on data, and poor classification result

Non-parametric models

Histograms:

Only practical in low dimensional space (<5 or so)

High dimensional space will lead to many cells, many of which will be empty

Naïve Bayes modeling in higher dimensional cases

K-nearest neighbor & kernel density estimation:

Need to store all training data

Need to find nearest neighbors or points with non-zero kernel evaluation (costly)

histogram

k-nn

k.d.e.

### Discriminative vs generative methods

Generative probabilistic methods

Model the density of inputs x from each class p(x|y)

Estimate class prior probability p(y)

Use Bayes’ rule to infer distribution over class given input

Discriminative (probabilistic) methods (next week)

Directly estimate class probability given input: p(y|x)

Some methods do not have probabilistic interpretation,

eg. they fit a function f(x), and assign to class 1 if f(x)>0,

and to class 2 if f(x)<0

Hybrid generative-discriminative models

Fit density model to data

Use properties of this model as input for classifier

Example: Fisher-vectors for image respresentation

### Clustering for visual vocabulary construction

Clustering of local image descriptors

using k-means or mixture of Gaussians

Recap of the image representation pipe-line

Extract image regions at various locations and scales Compute descriptor for each region (eg SIFT)

(Soft) assignment each descriptors to clusters

Make histogram for complete image

Summing of vector representations of each descriptor

Input to image classification method

Cluster indexes

Image regions

### Fisher Vector motivation

Feature vector quantization is computationally expensive in practice

Run-time linear in

N: nr. of feature vectors ~ 10^3 per image

D: nr. of dimensions ~ 10^2 (SIFT)

K: nr. of clusters ~ 10^3 for recognition

So in total in the order of 10^8 multiplications per image to assign SIFT descriptors to visual words

We use histogram of visual word counts

Can we do this more efficiently ?!

Reading material: “Fisher Kernels on Visual

Vocabularies for Image Categorization”

F. Perronnin and C. Dance, in CVPR'07

Xerox Research Centre Europe, Meylan

### Fisher vector image representation

MoG / k-means stores nr of points per cell

Need many clusters to represent distribution of descriptors in image

But increases computational cost

Fischer vector adds 1st & 2nd order moments

More precise description regions assigned to cluster

Fewer clusters needed for same accuracy

Representation (2D+1) times larger, at same computational cost

Terms already calculated when computing soft-assignment

2

2

3

5

1

1

5

3

2

2

4

8

4

2

2

8

4

2

2

4

1

2

1

3

2

2

3

2

qnk: soft-assignment of image region to

cluster (Gaussian mixture component)

### Image representation using Fisher kernels

General idea of Fischer vector representation

Fit probabilistic model to data

Use derivative of data log-likelihood as data representation, eg.for classification

[Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.]

Here, we use Mixture of Gaussians to cluster the region descriptors

Concatenate derivatives to obtain data representation

### Image representation using Fisher kernels

Extended representation of image descriptors using MoG

Displacement of descriptor from center

Squares of displacement from center

From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension)

Simplified version obtained when

Using this representation for a linear classifier

Diagonal covariance matrices, variance in dimensions given by vector vk

For a single image region descriptor

Summed over all descriptors this gives us

1: Soft count of regions assigned to cluster

D: Weighted average of assigned descriptors

D: Weighted variance of descriptors in all dimensions

### Fisher vector image representation

MoG / k-means stores nr of points per cell

Need many clusters to represent distribution of descriptors in image

Fischer vector adds 1st & 2nd order moments

More precise description regions assigned to cluster

Fewer clusters needed for same accuracy

Representation (2D+1) times larger, at same computational cost

Terms already calculated when computing soft-assignment

Comp. cost is O(NKD), need difference between all clusters and data

2

2

3

5

1

1

5

3

2

2

4

8

4

2

2

8

4

2

2

4

1

2

1

3

2

2

3

2

### Images from categorization task PASCAL VOC

Yearly “competition” for image classification (also object localization, segmentation, and body-part localization)

### Fisher Vector: results

BOV-supervised learns separate mixture model for each image class, makes that some of the visual words are class-specific

MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors

Other results: based on linear classifier of the image descriptions

Similar performance, using 16x fewer Gaussians

Unsupervised/universal representation good

### Plan for this course

• Introduction to machine learning

• Clustering techniques

• k-means, Gaussian mixture density

• Gaussian mixture density continued

• Parameter estimation with EM

• Classification techniques 1

• Introduction, generative methods, semi-supervised