CS479

1. CS479/679 Pattern RecognitionSpring 2006 � Prof. Bebis Linear Discriminant Functions Chapter 5 (Duda et al.)

2. Statistical vs Discriminant Approach Parametric/non-parametric density estimation techniques find the decision boundaries by first estimating the probability distribution of the patterns belonging to each class. In the discriminant-based approach, the decision boundary is constructed explicitly. Knowledge of the form of the probability distribution is not required.

3. Discriminant Approach Classification is viewed as �learning good decision boundaries� that separate the examples belonging to different classes in a data set.

4. Discriminant function estimation Specify a parametric form of the decision boundary (e.g., linear or quadratic) . Find the �best� decision boundary of the specified form using a set of training examples. This is done by minimizing a criterion function e.g., �training error� (or �sample risk�)

5. Linear Discriminant Functions A linear discriminant function is a linear combination of its components: where w is the weight vector and w0 is the bias (or threshold weight).

6. Linear Discriminant Functions: two category case Decide w1 if g(x) > 0 and w2 if g(x) < 0 If g(x)=0, then x is on the decision boundary and can be assigned to either class.

7. Linear Discriminant Functions: two category case (cont�d) If g(x) is linear, the decision boundary is a hyperplane. The orientation of the hyperplane is determined by w and its location by w0. w is the normal to the hyperplane. If w0=0, the hyperplane passes through the origin.

8. Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane.

9. Interpretation of g(x) (cont�d) Substitute the above expression in g(x): This gives the distance of x from the hyperplane: w0 determines the distance of the hyperplane from the origin:

10. Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions: One against the rest (i.e., c-1 two-class problems)

11. Linear Discriminant Functions: multi-category case (cont�d) One against another (i.e., c(c-1)/2 pairs of classes)

12. Linear Discriminant Functions: multi-category case (cont�d) To avoid the problem of ambiguous regions: Define c linear discriminant functions Assign x to wi if gi(x) > gj(x) for all j ? i. The resulting classifier is called a linear machine

13. Linear Discriminant Functions: multi-category case (cont�d)

14. Linear Discriminant Functions: multi-category case (cont�d) The boundary between two regions Ri and Rj is a portion of the hyperplane given by: The decision regions for a linear machine are convex.

15. Higher order discriminant functions Can produce more complicated decision boundaries than linear discriminant functions.

16. Higher order discriminant functions (cont�d) Generalized discriminant - a is a dimensional weight vector - the functions yi(x) are called f functions The functions yi(x) map points from the d-dimensional x-space to the -dimensional y-space (usually >> d )

17. Generalized discriminant functions The resulting discriminant function is not linear in x but it is linear in y. The generalized discriminant separates points in the transformed space by a hyperplane passing through the origin.

18. Generalized discriminant functions (cont�d) Example: Maps a line in x-space to a parabola in y-space. The plane aty=0 divides the y-space in two decision regions The corresponding decision regions R1,R2 in the x-space are not simply connected!

19. Generalized discriminant functions (cont�d)

20. Generalized discriminant functions (cont�d) Practical issues. Computationally intensive. Lots of training examples are required to determine a if is very large (i.e.,curse of dimensionality).

21. Notation:Augmented feature/weight vectors

22. Two-Category, Linearly Separable Case Given a linear discriminant function g(x)=aty, the goal is to learn the weights using a set of n labeled samples (i.e., examples and their associated classes). Classification rule: If atyi>0 assign yi to ?1 else if atyi<0 assign yi to ?2

23. Two-Category, Linearly Separable Case (cont�d) Every training sample yi places a constraint on the weight vector a. Given n examples, the solution must lie on the intersection of n half-spaces.

24. Two-Category, Linearly Separable Case (cont�d)

25. Two-Category, Linearly Separable Case (cont�d)

26. Iterative Optimization Define a criterion function J(a) that is minimized if a is a solution vector. Minimize J(a) iteratively ...

27. Gradient Descent Gradient descent rule:

28. Gradient Descent (cont�d)

29. Gradient Descent (cont�d)

30. Gradient Descent (cont�d) How to choose the learning rate h(k)? Note: if J(a) is quadratic, the learning rate is constant!

31. Newton�s method

32. Newton�s method (cont�d)

33. Comparison:Gradient descent vs Newton�s method

34. Perceptron rule where Y(a) is the set of samples misclassified by a. If Y(a) is empty, Jp(a)=0; otherwise, Jp(a)>0

35. Perceptron rule (cont�d) The gradient of Jp(a) is: The perceptron update rule is obtained using gradient descent:

36. Perceptron rule (cont�d)

37. Perceptron rule (cont�d) Move the hyperplane so that training samples are on its positive side.




41. Perceptron rule (cont�d) Some Direct Generalizations Variable increment and a margin



44. Relaxation Procedures Note that different criterion functions exist One possible choice is: Where Y is again the set of the training samples that are misclassified by a However, there are two problems with this criterion The function is too smooth and can converge to a=0 Jq is dominated by training samples with large magnitude

45. Relaxation Procedures (cont�d) A modified version that avoids the above two problems is Here Y is the set of samples for which Its gradient is given by

46. Relaxation Procedures (cont�d)




50. Minimum Squared Error Procedures Minimum squared error and pseudoinverse The problem is to find a weight vector a satisfying Ya=b If we have more equations than unknowns, a is over-determined. We want to choose the one that minimizes the sum-of-squared-error criterion function

51. Minimum Squared Error Procedures (cont�d) Pseudoinverse

52. Minimum Squared Error Procedures (cont�d)

CS479

CS479

Presentation Transcript

CS479

CS479/679 Pattern Recognition Dr. George Bebis

CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis

CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis