1 / 36

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.). CS479/679 Pattern Recognition Dr. George Bebis. Learning through empirical risk minimization. Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error or empirical risk :.

istas
Download Presentation

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern RecognitionDr. George Bebis

  2. Learning through empirical risk minimization • Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error or empirical risk: class labels:

  3. Learning through empirical risk minimization (cont’d) • Conventional empirical risk minimization over the training data does not imply good generalization to novel test data. • There could be several different functions g(x) which all approximate the training data set well. • Difficult to determine which function would have the best generalization performance.

  4. Learning through empirical risk minimization (cont’d) Solution 1 Solution 2 Which solution is better?

  5. Statistical Learning:Capacity and VC dimension • To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled. • Functions with high capacity are more complicated (i.e., have many degrees of freedom). high capacity low capacity

  6. Statistical Learning:Capacity and VC dimension (cont’d) • How do we measure capacity? • In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity. • The VC dimension can predict a probabilistic upper bound on the test error (i.e., generalization error) of a classification model.

  7. Statistical Learning:Capacity and VC dimension (cont’d) • A function that (1) minimizes the empirical risk and (2) has low VC dimension will generalize well regardless of the dimensionality of the input space: with probability(1-δ); (n: # of training examples) (Vapnik, 1995, “Structural Risk Minimization Principle”) structural risk minimization n

  8. VC dimension and margin of separation • Vapnik has shown that maximizing the margin of separation between classes is equivalent to minimizing the VC dimension. • The optimal hyperplane is the one giving the largest margin of separation between the classes.

  9. Margin of separation and support vectors • How is the margin defined? • The margin (i.e., empty space around the decision boundary) is defined by the distance to the nearest training patterns which we refer to as support vectors. • Intuitively speaking, these are the most difficult patterns to classify.

  10. Margin of separation andsupport vectors (cont’d) different solutions corresponding margins

  11. SVM Overview • SVMs perform structural risk minimization to achieve good generalization performance. • The optimization criterion is the margin of separation between classes. • Training is equivalent to solving a quadratic programming problem with linear constraints. • Primarily two-class classifiers but can be extended to multiple classes.

  12. Linear SVM: separable case • Linear discriminant • Class labels • Normalized version Decide ω1if g(x) > 0 and ω2 if g(x) < 0

  13. Linear SVM: separable case (cont’d) • The distance of a point xk from the separating hyperplane should satisfy the constraint: • To ensure uniqueness, we impose: • The above constraint becomes: (b is the margin)

  14. Linear SVM: separable case (cont’d) quadratic programming problem maximize margin:

  15. Linear SVM: separable case (cont’d) • Use Langrange optimization, we seek to minimize: • Easier to solve the “dual” problem (Kuhn-Tucker construction):

  16. Linear SVM: separable case (cont’d) • The solution is given by: • It can be shown that if xk is not a support vector, then the corresponding λk=0. dot product Only support vectors affect the solution!

  17. Linear SVM: non-separable case • Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :

  18. Linear SVM: non-separable case (cont’d) • The constant c controls the trade-off between the margin and misclassification errors. • Aims to prevent outliers from affecting the optimal hyperplane.

  19. Linear SVM: non-separable case (cont’d) • Easier to solve the “dual” problem (Kuhn-Tucker construction):

  20. Nonlinear SVM • Extending these concepts to the non-linear case involves mapping the data to a high-dimensional space h: • Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.

  21. Nonlinear SVM (cont’d) Example:

  22. Nonlinear SVM (cont’d) linear SVM: non-linear SVM:

  23. Nonlinear SVM (cont’d) • The disadvantage of this approach is that the mapping might be very computationally intensive to compute! • Is there an efficient way to compute ? non-linear SVM:

  24. The kernel trick • Compute dot products using a kernel function

  25. The kernel trick (cont’d) • Comments • Kernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper) • The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is. • Advantages of kernel trick • No need to know Φ() • Computations remain feasible even if the feature space has high dimensionality.

  26. Polynomial Kernel K(x,y)=(x . y) d

  27. Choice of Φ is not unique

  28. Kernel functions

  29. Example

  30. Example (cont’d) h=6

  31. Example (cont’d)

  32. Example (cont’d) (Problem 4)

  33. Example (cont’d) w0=0

  34. Example (cont’d) w =

  35. Example (cont’d) w = The discriminant

  36. Comments on SVMs • SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima) • Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set. • Performance depends on the choice of the kernel and its parameters. • Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.

More Related