300 likes | 358 Views
Elements of Pattern Recognition CNS/EE-148 -- Lecture 5 M. Weber P. Perona. What is Classification?. We want to assign objects to classes based on a selection of attributes (features). Examples: (age, income) {credit worthy, not credit worthy}
E N D
Elements ofPattern RecognitionCNS/EE-148 -- Lecture 5M. WeberP. Perona
What is Classification? • We want to assign objects to classes based on a selection of attributes (features). • Examples: • (age, income) {credit worthy, not credit worthy} • (blood cell count, body temp) {flue, hepatitis B, hepatitis C} • (pixel vector) {Bill Clinton, coffee cup} • Feature vector can be continuous, discrete or mixed.
x2 Signal 1 Signal 2 x1 Noise What is Classification? • Want to find a function from measurements to class labels decision boundary. Space of Feature Vectors • Statistical methods use pdf:p(C,x) • Assume p(C,x) known for now
Some Terminology • p(C) is called a prior or a priori probability • p(x|C) is called a class-conditional density or likelihood of C with respect to x • p(C|x) is called a posterior or a posteriori probability
Examples • One measurement, symmetric cost, equal priors bad p(x|C2) p(x|C1) x
Examples • One measurement, symmetric cost, equal priors good p(x|C2) p(x|C1) x
How to Make the Best Decision? (Bayes Decision Theory) • Define a cost function for mistakes, e.g. • Minimize expected loss (risk) over entire p(C,x). • Sufficient to assure optimal decision for each individual x. • Result: decide according to maximum posterior probability:
Two Classes, C1, C2 • It is helpful to consider the likelihood ratio: • Use known priors p(Ci) or ignore them. • For more elaborate loss function (proof is easy): • g(x) is called a discriminant function ?
Discriminant Functions for Multivariate Gaussian Class Conditional Densities • Two multivariate Gaussians in d dimensions • Since log is monotonic, we can look at log g(x). Mahalanobis Distance2 superfluous
iso-distance lines = iso-probability lines Decision surface: Mahalanobis Distance x2 2 1 x1 decision surface
Case 1: i = 2I • Discriminant functions… • …simplify to:
Decision Boundary • If 2=0, we obtain...The matched filter! With an expression for the threshold.
Two Signals and Additive White Gaussian Noise x2 Signal 1 1-2 1 x x-2 Signal 2 2 x1
Case 2: i = • Two classes, 2D measurements, p(x|C) are multivariate Gaussians with equal covariance matrices. • Derivation is similar • Quadratic term vanishes since it is independent of class • We obtain a linear decision surface • Matlab demo
Case 3: General Covariance Matrix • See transparency
Isn’t this to simple? • Not at all… • It is true that images form complicated manifolds (from a pixel point of view, translation, rotation and scaling are all highly non-linear operations) • The high dimensionality helps
Assume Unknown Class Densitites • In real life, we do not know the class conditional densities. • But we do have example data. • This puts us in the typical machine learning scenario:We want to learn a function, c(x), from examples. • Why not just estimate class densities from examples and apply the previous ideas? • Learn Gaussian (simple density): in N dimensions need N2 samples at least! • 10x10 pixels 10,000 examples! • Avoid estimating densities whenever you can! (too general) • posterior is generally simpler than class conditional(see transparency)
Remember PCA? x2 • Principal components are eigenvectors of covariance matrix • Use reconstruction error for recognition (e.g. Eigenfaces) • good • reduces dimensionality • bad • no model within subspace • linearity may be inappropriate • covariance not appropriate to optimize discrimination x u1 x1
Fisher’s Linear Discriminant • Goal: Reduce dimensionality before training classifiers etc. (Feature Selection) • Similar goal as PCA! • Fisher has classification in mind… • Find projection directions such that separation is easiest • Eigenfaces vs. Fisherfaces x2 x1
Fisher’s Linear Discriminant • Assume we have n d-dimensional samples x1,…,xn • n1from set (class) X1 and n2 from set X2 • we form linear combinations: • and obtain y1…,yn • only direction of w is important
Objective for Fisher • Measure the separation as the distance between the means after projecting (k = 1,2): • Measure the scatter after projecting: • Objective becomes to maximize
We need to make the dependence on w explicit: • Defining the within-class scatter matrix, SW=S1+S2, we obtain • Similarly for the separation (between-class scatter matrix) • Finally we can write
Fisher’s Solution • Is called a generalized Rayleigh quotient. Any w that maximizes J must satisfy the generalized eigenvalue problem • Since SB is very singular (rank 1), and SBw is in the direction of (m1-m2), we are done:
Comments on FLD • We did not follow Bayes Decision Theory • FLD is useful for many types of densities • Fisher can be extended (see demo): • more than one projection direction • more than two clusters • Let’s try it out: Matlab Demo
Fisher vs. Bayes • Assume we do have identical Gaussian class densities, then Bayes says: • while Fisher says: • Since SW is proportional to the covariance matrix, w is in the same direction in both cases. • Comforting...
What have we achieved? • Found out that maximum posterior strategy is optimal. Always. • Looked at different cases of Gaussian class densities, where we could derive simple decision rules. • Gaussian classifiers do reasonable jobs! • Learned about FLD which is useful and often preferable to PCA.
Just for Fun: Support Vector Machine x2 • Very fashionable…s.o.t.a? • Does not model densities • Fits decision surface directly • Maximizes margin reduces “complexity” • Decision surface only depends on nearby samples • Matlab Demo x1
p(x,y) Learning Algorithms Examples: (xi,yi) y = f(x) f = ? Learning Algorithm Learned function Set of functions
Assume Unknown Class Densitites • SVM Examples • Densitites are hard to estimate -> avoid it • example from Ripley • Give intuitions on overfitting • Need to learn • Standard machine learning problem • Training/Test sets