Linear Discriminant Functions

Linear Discriminant Functions Wen-Hung Liao, 11/25/2008

Introduction: LDF • Assume we know the proper form of the discriminant functions, instead of the underlying probability densities. • Use samples to estimate the parameters of the classifier.(statistical or non-statistical) • Will be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x.

Why LDF? • Simplicity vs. accuracy • Attractive candidates for initial, trial classifiers • Related to neural networks

Approach • Find the LDF by minimizing a criterion function. • Use gradient descent procedure for minimization • Convergence property • Computational complexities • Example of criterion function: Sample risk, or training error. (Not appropriate, why?) Because a small training error does not guarantee a small test error.

LDF and Decision Surfaces • A linear discriminant function: where w : weight vector w0: bias or threshold

Two-Category Case • Decision rule: • Decide w1 if g(x) > 0, decide w2 if g(x)<0 • In other words, x is assigned to w1 if the inner product wtx exceeds the threshold –w0.

Decision Boundary • A hyperplane H defined by g(x)=0 • If x1 and x2 are both on the decision surface, then: • w is normal to any vector lying on the hyperplane.

Distance Measure • For any x, where xp is the normal projection of x onto H , and r is the algebraic distance.

Multi-category Case • General case: • c-1 2-class c(c-1)/2 linear discriminant

Use c linear discriminants

Distance Measure • wi-wj is normal to Hij. • Distance for x to Hij is given by:

Quadratic DF • Add terms involving products of pairs of component of x to obtain the quadratic discriminant function: • The separating surface defined by g(x)=0 is a hyperquadric function.

Hyperquadric Surfaces • If W=[wij] is not singular, then the linear terms in g(x) can be eliminated by translating the axes. • Define a scale matrix: • Hypersphere • Hyperellipsoid • Hyperperboloid

Generalized LDF • Polynomial discriminant functions • Generalized LDF:

Augment Vectors • Augment feature vector: • Augment weight vector: • Mapping a d-dimensional x-space to (d+1)-dimensional y-space

2-Category Separable Case • Look for a weight vector that classifies all of the samples correctly. If such a weight does exist, then the samples are said to be linearly separable.

Gradient Descent Procedure • Define a criterion function J(a) that is minimized if a is a solution vector. • Step 1: Randomly pick a(1), and compute the gradient vector: • Step 2: a(2) is obtained by moving some distance from a(1) in the direction of the steepest descent.

Setting the Learning Rate • Second-order expansion of J(a): • Substituting • Minimized when

Newton Descent • For nonsingular H • Converges faster but more difficult to compute per step.

Perceptron Criterion Function where Y(a) is the set of samples misclassified by a. • Since • Update rule:

Convergence Proof • Refer to page 229 to 232 of textbook.

Linear Discriminant Functions