EE 4780

1 / 50

# EE 4780 - PowerPoint PPT Presentation

EE 4780. Pattern Classification. Classification Example. Goal : Automatically classify incoming fish according to species, and send to respective packing plants. Features : Length, width, color, brightness, etc.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### EE 4780

Pattern Classification

Classification Example

Goal: Automatically classify incoming fish according to species, and send to respective packing plants.

Features: Length, width, color, brightness, etc.

Model: Sea bass have some typical length, and it is greater than that for salmon.

Classifier: If the fish is longer than a value, l*, classify it as sea bass.

Training Samples: To choose l*, make length measurements from training samples and inspect the results.

Classification Example

Decision boundary

Now, we have two features two classify the fish: the lightness x1, and the width x2.

Feature vector:x=[x1 x2]’.

The feature extractor reduces the image of a fish to a feature vectorx in a 2D feature space.

Feature Extraction
• The goal of feature extractor is to characterize an object to be recognized by measurements whose values are very similar for objects in the same category, and very different for objects in different categories.
• The features should be invariant to the irrelevant transformation of the input. For example, the location of a fish on the belt is irrelevant, and thus the representation should be insensitive to the location of the fish.
Classification
• The task of the classifier is to use feature vectors (provided by the feature extractor) to assign the object to a category.
• Perfect classification is often impossible, a more general task is to determine the probability for each of the possible categories.
• The process of using data to determine the classifier is referred to as training the classifier.

x1

x2

Class

1 or 2 or ….. or c

Raw Data

Feature

Extractor

Classifier

xd

Classical Model
• We measure a fixed set of d features for an object that we want to classify.
• For example,
• x1 = height
• x2 = perimeter
• ...
• xd = average pixel intensity

x3

x

x1

x2

x1

x2

x =

xd

Feature Vectors
• We can think of our feature set as a feature vector x, where x is the d-dimensional column vector
• Can think of x as being a point in a d-dimensional feature space.
• By this process of feature measurement, we can represent an object as a point in feature space.
• Template matching
• Minimum-distance classifiers
• Metrics
• Inner products
• Linear discriminants
• Bayesian approach
Template Matching
• To classify one of the noisy characters, simply compare it to the two ‘templates’ on the left
• Comparison can be done in many ways - here are two:
• Count the number of places where the template and pattern agree. Pick the class that has the maximum number of agreements.
• Count the number of places where the template and pattern disagree. Pick the class that has the smallest number of disagreements.
• This may not work well when there is rotation, scaling, warping, occlusion, etc.

?

=

g

f

Most

popular

Template Matching

Question: How can we achieve rotation invariance?

Minimum Distance Classifiers
• Template matching can be expressed mathematically through a notion of distance.
• Let x be the feature vector for the unknown input, and let m1, m2, ..., mc be templates (i.e., perfect, noise-free feature vectors) for the c classes.
• The error in matching x against mk is given by || x - mk ||.
• Choose the class for which the error is a minimum.
• Since || x - mk || is the distance from x to mk, the technique is called minimum distance classification.

m1

Distance

x

m2

Distance

Class

Minimum Selector

mc

Distance

Minimum Distance Classifiers

m3

x

m2

m1

a=x-m1

Euclidean distance

“Sum of absolute values”

x’ = [x1, x2, ….., xd]

d

x’y = x1 y1 + x2 y2 ….., xd yd = S xkyk

k=1

x1

x2

x =

xd

Euclidean Distance
• x is a column vector of d features, x1, x2, ... , xd.
• By using the transpose operator ' we can convert the column vector x to the row vector x':
• The inner product of two column vectors x and y is defined by
• Thus the norm of x (using the Euclidean metric) is given by

|| x || = sqrt( x' x )

Inner Products
• Important additional properties of inner products:
• x' y = y' x = || x || || y || cos( angle between x and y )
• x' ( y + z ) = x' y + x' z .
• The inner product of x and y is maximum when the angle between them is zero, i.e., when one is just a positive multiple of the other.
• Sometimes we say
• that x' y is the correlation between x and y, and
• that the correlation is maximum when x and y point in the same direction.
• If x' y = 0, the vectors x and y are said to be orthogonal or uncorrelated.

Minimum Distance Classifiers

Example: Let m1=[4.3 1.3]’ and m2=[1.5 0.3]’. Find the decision boundary.

k

= -2 [m’ x - .5 mk’ mk ]+x’ x

k

g(x) = m’ x - .5 ||mk||2

k

Linear Discriminants
• For minimum distance classifier, we chose the nearest class
• Use the inner product to express the Euclidean distance from x to mk:
• To find the template mk which minimizes ||x-mk||, it is sufficient to find the mk which maximizes the bracketed term above.
• Define the linear discriminant function g(x) as

constant

constant

m1

x

m2

Class

Maximum Selector

md

Min Euclidean distance Classifier
• A minimum-Euclidean-distance classifier classifies an input feature vector x by computing c linear discriminant functions

g1(x), g2(x), ... , gc(x)

and assigning x to the class corresponding to the maximum discriminant function.

g1(x)

g2(x)

gc(x)

Feature Scaling
• The numerical value for a feature x depends on the units used, .i.e., on the scale.
• If x is multiplied by a scale factor a, both the mean and the standard deviation are multiplied by a.
• The variance is multiplied by a2.
• Sometimes it is desirable to scale the data so that the resulting standard deviation is unity.
• divide x by the standard deviation s.
• Similarly, in measuring the distance from x to m, it often makes sense to measure it relative to the standard deviation.

2

2

2

x1 - m1j

x2 - m2j

xd - mdj

+

+

+

r(x,mj)2 =

••••

s1j

s2j

sdj

Feature Scaling
• This suggests an important generalization of a minimum-Euclidean-distance classifier.
• Let x(i) be the value for Feature i,
• let m(i,j) be the mean value of Feature i for Class j, and
• let s(i,j) be the standard deviation of Feature i for Class j.
• In measuring the distance between the feature vector x and the mean vector mj for Class j, use the standardized distance
Covariance
• The covariance of two features measures their tendency to vary together, i.e., to co-vary.
• The variance is the average of the squared deviation of a feature from its mean, the covariance is the average of the products of the deviations of feature values from their means.
• Consider Feature i and Feature j.
• Let { x(1,i), x(2,i), ... , x(n,i) } be a set of n examples of Feature i
• Let { x(1,j), x(2,j), ... , x(n,j) } be a corresponding set of n examples of Feature j

[ x(1,i) - m(i) ] [ x(1,i) - m(i) ] + ... + [ x(n,i) - m(i) ] [ x(n,i) - m(i) ]

s(i)2=

n-1

Variance
• Let m(i) be the mean of Feature i
• Then the variance of Feature i is
• s(i) is the standard deviation of Feature i

[ x(1,i) - m(i) ] [ x(1,j) - m(j) ] + ... + [ x(n,i) - m(i) ] [ x(n,j) - m(j) ]

c(i,j) =

n-1

Covariance
• Let m(i) be the mean of Feature i, and m(j) be the mean of Feature j.
• Then the covariance of Feature i and Feature j is defined by
• The covariance has several important properties:
• If Feature i and Feature j tend to increase together, then c(i,j) > 0
• If Feature i tends to decrease when Feature j increases, then c(i,j) < 0
• If Feature i and Feature j are independent, then c(i,j) = 0
• | c(i,j) | <= s(i) s(j), where s(i) is the standard deviation of Feature i
• c(i,i) = s(i)2 variance of Feature i

c(1,1) c(1,2) .... c(1,d)

c(2,1) c(2,2) .... c(2,d)

c(d,1) c(d,2) .... c(d,d)

C =

Covariance Matrix
• All of the covariances c(i,j) can be collected together into a covariance matrix C:

-1

r2 = (x-mx)TCx (x-mx)

2

x - m

1

r2 =

= (x-m) (x-m)

s

s2

Covariance Matrix
• Need to normalize the distance
• Recall what we did earlier to get a standardized distance for a single feature:
• What is the matrix generalization of the scalar equation?
Bayesian Decision Theory
• Return to fish example. There are two categories. Denote these categories as w1 for sea bass and w2 for salmon.
• Assume that there is some prior probability (or simply prior) P(w1)that the next fish is sea bass, and some prior probability that P(w2)that it is salmon.
• Suppose that we make a decision without making a measurement. The logical decision rule is

Decide w1 if P(w1) > P(w2); otherwise decide w2

Bayesian Decision Theory
• Suppose that we have a feature vector x; now the decision rule is

Decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2

• Using the Bayes formula

where

Bayesian Decision Theory
• Define a set of discriminant functions gi(x), i=1,…,c

OR

Gaussian Density

Univariate

Multivariate

Example
• Suppose there are two classes: w1 and w2; and the classification decision is made based on a feature measurement, x.
• The conditional densities are Gaussian distributions: N(mean,variance)
• p(x|w1) ~ N(1,1)
• p(x|w2) ~ N(5,4)
• The prior probabilities are P(w1) = 0.2 and P(w2) = 0.8
• What is the class of an object if its feature x = 2 ?
• Find the decision boundary when P(w1) = P(w2) = 0.5.
• Find the decision boundary when P(w1) = 0.2 and P(w2) = 0.8.
Gaussian Density

Center of the cluster is determined by the mean vector, and the shape of the cluster is determined by the covariance matrix.

“Mahalonobis distance” from x to mean.

Discriminant Functions for Gaussian
• Let us examine the discriminant function for
Discriminant Functions for Gaussian
• Case I:

As the priors change, the decision boundaries shift.

Discriminant Functions for Gaussian
• Examples: Find the decision boundaries for 1D and 2D Gaussian data.

Solve for x from

Parameter Estimation
• We learned how we could design an optimal classifier if we knew the prior probabilities P(wi) and the class-conditional densities p(x|wi).
• In a typical application, we rarely have complete knowledge. We typically have some general knowledge and a number of design samples (or training data).
• We use the samples to estimate the unknown probabilities and probability densities, and then use these estimates as if they were true values.
• If the densities could be parameterized, the problem is simplified significantly. (For example, for Gaussian distribution, mean and covariance matrix are the only parameters we need to estimate.)
Parameter Estimation

Gaussian case:

Dimensionality
• The accuracy degrades when the dimensionality is large.
• The dimensionality can be reduced by combining features.
• Linear combinations are attractive because they are simple to compute and analytically tractable.
• Dimensionality reduction techniques include
• Principal Component Analysis
• Fisher’s Discriminant Analysis
Principal Component Analysis (PCA)
• Find a lower dimensional space that best represents the data in a least-squares sense.

Full N-dimensional

space (here N = 2)

d-dimensional subspace

(here d = 1)

U. of Delaware

Principal Component Analysis (PCA)
• We begin by considering the problem of representing N-dimensional vectors x1, x2, …, xn by a single vector x0.
• To be more specific, suppose that we want to find a vector x0 such that the sum of squared differences between x0 and xk is as small as possible.
• Define cost function to be minimized:
• The solution is the sample mean:
Principal Component Analysis (PCA)
• The sample does not reveal any of the variability in the data. Let’s now consider a solution of the form

where ak is a scalar and e is a unit vector.

• Define cost function to be minimized:
• The solution is
Principal Component Analysis (PCA)
• What is the best direction e for the line?

Using

We get

where

Find e that maximizes

Principal Component Analysis (PCA)
• The solution is

where

Since

we select the eigenvector corresponding to the largest eigenvalue.

Principal Component Analysis (PCA)
• Generalize it to d dimensions (d<=n)

Find the eigenvectors e1, e2, …, edcorresponding to d largest

eigenvalues of S.

Eigenface Approach
• Reduce the dimensionality by applying PCA:
• Apply PCA to a training dataset to find the first d principal components.
• Find the weights for all images.
• Classify the probe using norm distance.

(d=8)