Linear Discriminant Functions Wen-Hung Liao, 11/25/2008
Introduction: LDF • Assume we know the proper form of the discriminant functions, instead of the underlying probability densities. • Use samples to estimate the parameters of the classifier.(statistical or non-statistical) • Will be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x.
Why LDF? • Simplicity vs. accuracy • Attractive candidates for initial, trial classifiers • Related to neural networks
Approach • Find the LDF by minimizing a criterion function. • Use gradient descent procedure for minimization • Convergence property • Computational complexities • Example of criterion function: Sample risk, or training error. (Not appropriate, why?) Because a small training error does not guarantee a small test error.
LDF and Decision Surfaces • A linear discriminant function: where w : weight vector w0: bias or threshold
Two-Category Case • Decision rule: • Decide w1 if g(x) > 0, decide w2 if g(x)<0 • In other words, x is assigned to w1 if the inner product wtx exceeds the threshold –w0.
Decision Boundary • A hyperplane H defined by g(x)=0 • If x1 and x2 are both on the decision surface, then: • w is normal to any vector lying on the hyperplane.
Distance Measure • For any x, where xp is the normal projection of x onto H , and r is the algebraic distance.
Multi-category Case • General case: • c-1 2-class c(c-1)/2 linear discriminant
Distance Measure • wi-wj is normal to Hij. • Distance for x to Hij is given by:
Quadratic DF • Add terms involving products of pairs of component of x to obtain the quadratic discriminant function: • The separating surface defined by g(x)=0 is a hyperquadric function.
Hyperquadric Surfaces • If W=[wij] is not singular, then the linear terms in g(x) can be eliminated by translating the axes. • Define a scale matrix: • Hypersphere • Hyperellipsoid • Hyperperboloid
Generalized LDF • Polynomial discriminant functions • Generalized LDF:
Augment Vectors • Augment feature vector: • Augment weight vector: • Mapping a d-dimensional x-space to (d+1)-dimensional y-space
2-Category Separable Case • Look for a weight vector that classifies all of the samples correctly. If such a weight does exist, then the samples are said to be linearly separable.
Gradient Descent Procedure • Define a criterion function J(a) that is minimized if a is a solution vector. • Step 1: Randomly pick a(1), and compute the gradient vector: • Step 2: a(2) is obtained by moving some distance from a(1) in the direction of the steepest descent.
Setting the Learning Rate • Second-order expansion of J(a): • Substituting • Minimized when
Newton Descent • For nonsingular H • Converges faster but more difficult to compute per step.
Perceptron Criterion Function where Y(a) is the set of samples misclassified by a. • Since • Update rule:
Convergence Proof • Refer to page 229 to 232 of textbook.