1 / 56

Linear Discriminant Functions

Linear Discriminant Functions. 主講人:虞台文. Contents. Introduction Linear Discriminant Functions and Decision Surface Linear Separability Learning Gradient Decent Algorithm Newton’s Method. Linear Discriminant Functions. Introduction. Decision-Making Approaches. Probabilistic Approaches

ciel
Download Presentation

Linear Discriminant Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Discriminant Functions 主講人:虞台文

  2. Contents • Introduction • Linear Discriminant Functions and Decision Surface • Linear Separability • Learning • Gradient Decent Algorithm • Newton’s Method

  3. Linear Discriminant Functions Introduction

  4. Decision-Making Approaches • Probabilistic Approaches • Based on the underlying probability densities of training sets. • For example, Bayesian decision rule assumes that the underlying probability densities were available. • Discriminating Approaches • Assume we know the proper forms for the discriminant functions were known. • Use the samples to estimate the values of parameters of the classifier.

  5. Linear Discriminating Functions • Easy for computing, analysis and learning. • Linear classifiers are attractive candidates for initial, trial classifier. • Learning by minimizing a criterion function, e.g., training error. • Difficulty: a small training error does not guarantee a small test error.

  6. Linear Discriminant Functions Linear Discriminant Functions and Decision Surface

  7. weight vector bias or threshold Linear Discriminant Functions The two-category classification:

  8. 1 w0 1 w2 wn w1 x1 x2 xd Implementation

  9. 1 1 w0 w0 1 1 x1 x2 xd Implementation w2 wn w1 x0=1

  10. 1 w0 1 x1 x2 xd Decision Surface w2 wn w1 x0=1

  11. xn g(x)=0 x2 x w0/||w|| w x1 Decision Surface

  12. xn g(x)=0 x2 x w0/||w|| w x1 Decision Surface • A linear discriminant function divides the feature space by a hyperplane. • 2. The orientation of the surface is determined by the normal vector w. • 3. The location of the surface is determined by the bias w0.

  13. x2 x1 w Augmented Space d-dimension (d+1)-dimension Let g(x)=w1x1+ w2x2 + w0=0 Decision surface:

  14. x2 x1 w Augmented Space Augmented feature vector Augmented weight vector d-dimension (d+1)-dimension Let g(x)=w1x1+ w2x2 + w0=0 Decision surface: Pass through the origin of augmented space

  15. y2 x2 y1 x1 a w y0 Augmented Space ay = 0 1 g(x)=w1x1+ w2x2 + w0=0

  16. Pass through the origin only when w0=0. Always pass through the origin. By using this mapping, the problem of finding weight vector w and threshold w0 is reduced to finding a single vector a. Augmented Space Decision surface in feature space: Decision surface in augmented space:

  17. Linear Discriminant Functions Linear Separability

  18. 2 2 1 1 The Two-Category Case Not Linearly Separable Linearly Separable

  19. 2 1 How to find a? The Two-Category Case Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

  20. How to find a? Normalization Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, Withdrawing all labels of samples and replacing the ones labeled 2 by their negatives, it is equivalent to find a vector a such that if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

  21. Separating Plane: Solution Region in Feature Space Normalized Case

  22. Separating Plane: Solution Region in Weight Space Shrink solution region by margin

  23. Linear Discriminant Functions Learning

  24. Criterion Function • To facilitate learning, we usually define a scalar criterion function. • It usually represents the penalty or cost of a solution. • Our goal is to minimize its value. • ≡Function optimization.

  25. Learning Algorithms • To design a learning algorithm, we face the following problems: • Whether to stop? • In what direction to proceed? • How long a step to take? Is the criterion satisfactory? e.g., steepest decent : learning rate

  26. J(a) Criterion Functions:The Two-Category Case # of misclassified patterns solution state where to go? Is this criterion appropriate?

  27. Criterion Functions:The Two-Category Case Y: the set of misclassified patterns Perceptron Criterion Function solution state where to go? Is this criterion better? What problem it has?

  28. Criterion Functions:The Two-Category Case Y: the set of misclassified patterns A Relative of Perceptron Criterion Function solution state where to go? Is this criterion much better? What problem it has?

  29. Criterion Functions:The Two-Category Case Y: the set of misclassified patterns What is the difference with the previous one? solution state where to go? Is this criterion good enough? Are there others?

  30. Learning Gradient Decent Algorithm

  31. a2 J(a) a2 a1 a1 Our goal is to go downhill. Gradient Decent Algorithm Contour Map (a1, a2)

  32. Our goal is to go downhill. Gradient Decent Algorithm Define aJis a vector

  33. Our goal is to go downhill. Gradient Decent Algorithm

  34. a2 J(a) (a1, a2) a1 Our goal is to go downhill. Gradient Decent Algorithm How long a step shall we take? aJ go this way

  35. (k): Learning Rate Gradient Decent Algorithm a, , k = 0 Initial Setting do k  k + 1 a a  (k)aJ(a) |aJ(a)| <  until

  36. Trajectory of Steepest Decent If improper learning rate (k) is used, the convergence rate may be poor. • Too small: slow convergence. • Too large: overshooting a* Basin of Attraction a2 a0 J(a1) Furthermore, the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent. J(a0) a1

  37. Learning Rate Paraboloid Q: symmetric & positive definite

  38. Learning Rate Paraboloid All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

  39. We will discuss the convergence properties using paraboloids. Learning Rate Paraboloid Global minimum (x*): Set

  40. Learning Rate Paraboloid Error Define

  41. Learning Rate Paraboloid Error Define We want to minimize and Clearly,

  42. Learning Rate Suppose that we are at yk. Let The steepest decent direction will be Let learning rate be k. That is, yk+1 = yk + k pk.

  43. Learning Rate We’ll choose the most ‘favorable’ k . That is, setting we get

  44. Learning Rate If Q = I,

  45. Convergence Rate

  46. Convergence Rate [Kantorovich Inequality]Let Q be a positive definite, symmetric, nnmatrix. For any vector there holds where 12  …  n are eigenvalues of Q. The smaller the better

  47. Convergence Rate Condition Number Faster convergence Slower convergence

  48. Convergence Rate Paraboloid Suppose we are now at xk. Updating Rule:

  49. Trajectory of Steepest Decent In this case, the condition number of Q is moderately large. One then see that the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent.

  50. Learning Newton’s Method

More Related