1 / 34

Support Vector Machines

Support Vector Machines. 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs. References: 1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear.

Download Presentation

Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs • References: • 1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear. • 2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/) • 3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf) • 4. For more resources on support vector machines, see http://www.kernel-machines.org/

  2. Introduction • SVMs were developed by Vapnik in 1995 and are becoming popular due to their attractive features and promising performance. • Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs. • SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error. • SVMs have been shown to posses better generalization capability than conventional neural networks.

  3. Introduction (Cont.) • Given N labeled empirical data: (1) where X is the set of input data in and yi are the labels. Domain X

  4. Introduction (Cont.) • We construct a simple classifier by computing the means of the two classes (2) where N1 and N2 are the number of data in the class with positive and negative labels, respectively. • We assign a new point x to the class whose mean is closer to it. • To achieve this, we compute

  5. Introduction (Cont.) • Then, we determine the class of x by checking whether the vector connecting x and cencloses an angle smaller than /2 with the vector Domain X where x

  6. Introduction (Cont.) • In the special case where b = 0, we have (3) • This means that we use ALL data point xi, each being weighted equally by 1/N1or 1/N2, to define the decision plane.

  7. Introduction (Cont.) x Decision plan Domain X

  8. Introduction (Cont.) • However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small. • We may also select only a few important data point (called support vectors) and weight them differently. • Then, we have a support vector machine.

  9. Introduction (Cont.) Margin Support vectors x Decision plane Domain X • We aim to find a decision plane that maximizes the margin.

  10. Linear SVMs • Assume that all training data satisfy the constraints: (4) which means (5) • Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

  11. Linear SVMs (Conts.) Margin: d • Therefore, maximizing the margin is equivalent to minimizing ||w||2.

  12. Linear SVMs (Lagrangian) • We minimize ||w||2 subject to the constraint that (6) • This can be achieved by introducing Lagrange multipliers and a Lagrangian (7) • The Lagrangian has to be minimized with respect to w and b and maximized with respect to

  13. Linear SVMs (Lagrangian) • Setting • We obtain (8) • Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy where S contains the indexes to the support vectors. • Patterns for which are considered to be irrelevant to the classification.

  14. Linear SVMs (Wolfe Dual) • Substituting (8) into (7), we obtain the Wolfe dual: (9) • The hyper-decision plane is thus

  15. Linear SVMs (Example) • Analytical example (3-point problem): • Objective function:

  16. Linear SVMs (Example) • We introduce another Lagrange multiplier λ to obtain the Lagrangian • Differentiating F(α, λ) with respect to λ and αiand set the results to zero, we obtain

  17. Linear SVMs (Example) • Substitute the Lagrange multipliers into Eq. 8

  18. Linear SVMs (Example) • 4-point linear separable problem: 4 SVs 3 SVs

  19. Linear SVMs (Non-linearly separable) • Non-linearly separable: patterns that cannot be separated by a linear decision boundary without incurring classification error. Data that causes classification error in linear SVMs

  20. Linear SVMs (Non-linearly separable) • We introduce a set of slack variables with • The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6): • Therefore, for some we have

  21. Linear SVMs (Non-linearly separable) • E.g. because x10 and x19 are inside the margins, i.e. they violate the constraint (Eq. 6).

  22. Linear SVMs (Non-linearly separable) • For non-separable cases: where C is a user-defined penalty parameter to penalize any violation of the margins. • The Lagrangian becomes

  23. Linear SVMs (Non-linearly separable) • Wolfe dual optimization: • The output weight vector and bias term are

  24. 2. Linear SVMs (Types of SVs) • Three types of support vectors • On the margin: 2. Inside the margin: 3. Outside the margin:

  25. 2. Linear SVMs (Types of SVs)

  26. 2. Linear SVMs (Types of SVs) Swapping Class 1 and Class 2

  27. 2. Linear SVMs (Types of SVs) • Effect of varying C: C = 0.1 C = 100

  28. 3. Non-linear SVMs • In case the training data X are not linearly separable, we may use a kernel function to map the data from the input space to a feature space where data become linearly separable. Decision boundary Decision boundary Input Space (Domain X) Feature Space

  29. 3. Non-linear SVMs (Conts.) • The decision function becomes (a)

  30. 3. Non-linear SVMs (Conts.)

  31. 3. Non-linear SVMs (Conts.) • The decision function becomes • For RBF kernels • For polynomial kernels

  32. 3. Non-linear SVMs (Conts.) • The optimization problem becomes: (9) • The decision function becomes

  33. 3. Non-linear SVMs (Conts.) • The effect of varying C on RBF-SVMs: C = 1000 C = 10

  34. 3. Non-linear SVMs (Conts.) • The effect of varying C on Polynomial-SVMs: C = 1000 C = 10

More Related