1 / 51

CS479

Statistical vs Discriminant Approach. Parametric/non-parametric density estimation techniques find the decision boundaries by first estimating the probability distribution of the patterns belonging to each class.In the discriminant-based approach, the decision boundary is constructed explicitly. Knowledge of the form of the probability distribution is not required..

hosanna
Download Presentation

CS479

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. CS479/679 Pattern Recognition Spring 2006 – Prof. Bebis Linear Discriminant Functions Chapter 5 (Duda et al.)

    2. Statistical vs Discriminant Approach Parametric/non-parametric density estimation techniques find the decision boundaries by first estimating the probability distribution of the patterns belonging to each class. In the discriminant-based approach, the decision boundary is constructed explicitly. Knowledge of the form of the probability distribution is not required.

    3. Discriminant Approach Classification is viewed as “learning good decision boundaries” that separate the examples belonging to different classes in a data set.

    4. Discriminant function estimation Specify a parametric form of the decision boundary (e.g., linear or quadratic) . Find the “best” decision boundary of the specified form using a set of training examples. This is done by minimizing a criterion function e.g., “training error” (or “sample risk”)

    5. Linear Discriminant Functions A linear discriminant function is a linear combination of its components: where w is the weight vector and w0 is the bias (or threshold weight).

    6. Linear Discriminant Functions: two category case Decide w1 if g(x) > 0 and w2 if g(x) < 0 If g(x)=0, then x is on the decision boundary and can be assigned to either class.

    7. Linear Discriminant Functions: two category case (cont’d) If g(x) is linear, the decision boundary is a hyperplane. The orientation of the hyperplane is determined by w and its location by w0. w is the normal to the hyperplane. If w0=0, the hyperplane passes through the origin.

    8. Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane.

    9. Interpretation of g(x) (cont’d) Substitute the above expression in g(x): This gives the distance of x from the hyperplane: w0 determines the distance of the hyperplane from the origin:

    10. Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions: One against the rest (i.e., c-1 two-class problems)

    11. Linear Discriminant Functions: multi-category case (cont’d) One against another (i.e., c(c-1)/2 pairs of classes)

    12. Linear Discriminant Functions: multi-category case (cont’d) To avoid the problem of ambiguous regions: Define c linear discriminant functions Assign x to wi if gi(x) > gj(x) for all j ? i. The resulting classifier is called a linear machine

    13. Linear Discriminant Functions: multi-category case (cont’d)

    14. Linear Discriminant Functions: multi-category case (cont’d) The boundary between two regions Ri and Rj is a portion of the hyperplane given by: The decision regions for a linear machine are convex.

    15. Higher order discriminant functions Can produce more complicated decision boundaries than linear discriminant functions.

    16. Higher order discriminant functions (cont’d) Generalized discriminant - a is a dimensional weight vector - the functions yi(x) are called f functions The functions yi(x) map points from the d-dimensional x-space to the -dimensional y-space (usually >> d )

    17. Generalized discriminant functions The resulting discriminant function is not linear in x but it is linear in y. The generalized discriminant separates points in the transformed space by a hyperplane passing through the origin.

    18. Generalized discriminant functions (cont’d) Example: Maps a line in x-space to a parabola in y-space. The plane aty=0 divides the y-space in two decision regions The corresponding decision regions R1,R2 in the x-space are not simply connected!

    19. Generalized discriminant functions (cont’d)

    20. Generalized discriminant functions (cont’d) Practical issues. Computationally intensive. Lots of training examples are required to determine a if is very large (i.e.,curse of dimensionality).

    21. Notation: Augmented feature/weight vectors

    22. Two-Category, Linearly Separable Case Given a linear discriminant function g(x)=aty, the goal is to learn the weights using a set of n labeled samples (i.e., examples and their associated classes). Classification rule: If atyi>0 assign yi to ?1 else if atyi<0 assign yi to ?2

    23. Two-Category, Linearly Separable Case (cont’d) Every training sample yi places a constraint on the weight vector a. Given n examples, the solution must lie on the intersection of n half-spaces.

    24. Two-Category, Linearly Separable Case (cont’d)

    25. Two-Category, Linearly Separable Case (cont’d)

    26. Iterative Optimization Define a criterion function J(a) that is minimized if a is a solution vector. Minimize J(a) iteratively ...

    27. Gradient Descent Gradient descent rule:

    28. Gradient Descent (cont’d)

    29. Gradient Descent (cont’d)

    30. Gradient Descent (cont’d) How to choose the learning rate h(k)? Note: if J(a) is quadratic, the learning rate is constant!

    31. Newton’s method

    32. Newton’s method (cont’d)

    33. Comparison: Gradient descent vs Newton’s method

    34. Perceptron rule where Y(a) is the set of samples misclassified by a. If Y(a) is empty, Jp(a)=0; otherwise, Jp(a)>0

    35. Perceptron rule (cont’d) The gradient of Jp(a) is: The perceptron update rule is obtained using gradient descent:

    36. Perceptron rule (cont’d)

    37. Perceptron rule (cont’d) Move the hyperplane so that training samples are on its positive side.

    38. Perceptron rule (cont’d)

    39. Perceptron rule (cont’d)

    40. Perceptron rule (cont’d)

    41. Perceptron rule (cont’d) Some Direct Generalizations Variable increment and a margin

    42. Perceptron rule (cont’d)

    43. Perceptron rule (cont’d)

    44. Relaxation Procedures Note that different criterion functions exist One possible choice is: Where Y is again the set of the training samples that are misclassified by a However, there are two problems with this criterion The function is too smooth and can converge to a=0 Jq is dominated by training samples with large magnitude

    45. Relaxation Procedures (cont’d) A modified version that avoids the above two problems is Here Y is the set of samples for which Its gradient is given by

    46. Relaxation Procedures (cont’d)

    47. Relaxation Procedures (cont’d)

    48. Relaxation Procedures (cont’d)

    49. Relaxation Procedures (cont’d)

    50. Minimum Squared Error Procedures Minimum squared error and pseudoinverse The problem is to find a weight vector a satisfying Ya=b If we have more equations than unknowns, a is over-determined. We want to choose the one that minimizes the sum-of-squared-error criterion function

    51. Minimum Squared Error Procedures (cont’d) Pseudoinverse

    52. Minimum Squared Error Procedures (cont’d)

More Related