1 / 32

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training . Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999. Outline. Statistical Learning Theory Empirical Risk Minimization

meris
Download Presentation

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial:Interior Point Optimization Methodsin Support Vector Machines Training Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999

  2. Outline • Statistical Learning Theory • Empirical Risk Minimization • Structural Risk Minimization • Linear SVM and Linear Separable Case • Primal Optimization Problem • Dual Optimization Problem • Non-Linear Case • Support Vector Regression • Dual Problem for Regression • Kernel Functions in SVMs • Open Problem

  3. Statistical Learning Theory (Vapnik 1995,1998) Empirical Risk Minimization • Given a set of decision functions {f(x): l} , f : n [-1,1] where is set of abstract parameters. • Suppose (x1,y1), (x2,y2), ..., (xl, yl) are such that x n, y {1,-1} are taken from an unknown distribution P(x,y). We want to find a f* which minimizes the expected risk functional where f(x), { f(x): l}are called hypothesis and hypothesis space, respectively.

  4. Empirical Risk Minimization • The problem is that the distribution function P(x,y) is unknown. We can not compute the expected risk. Instead we compute the empirical risk • The idea behind minimizing empirical risk is that if Remp converges to the expected risk, then minimum of Remp may converge to the minimum of expected risk. • A typical uniform VC bound, which holds with probability 1-, has the following form

  5. Structural Risk Minimization • A small value of the empirical risk does not necessarily imply a small value of expected risk. • Structural Risk Minimization Principle (SRM) (Vapnik 1982,1995): VC dimension and empirical risk should be minimized at the same time. • Need a nested structure of hypothesis space • H1 H2 H3 ...  Hn..... • with the property that h(n)  h(n+1), where h(n) is the VC dimension of the Hn. • Need to solve the following problem

  6. Linear SVM and Linear Separable Case • Assume that we are given a set S of points xin where each xi belongs to either of two classes defined by yi {1,-1}. The objective is to find a hyperplane that divides S leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane [Vapnik 1995]. • Definition 1. The set S is linearly separable if there exists a w n and a b  such that • In order to make each decision surface corresponding to one unique pair (w,b), the following constraint is imposed.

  7. Relationship between VC dimension and the canonical hyperplane. • Suppose all points x1, x2, x3, ... ,x1 lie in the n-unit dimensional sphere. The set has a VC dimension h that satisfies the following bound h  min {A2, L} + 1 • Maximizing margin minimizes the function complexity

  8. continued • The distance from a point x to the hyperplane associated to the pair (w,b) is • The distance between canonical hyperplane and the closest point is • The goal of the SVM is to find, among all the hyperplanes that correctly classify the data with the minimum norm, or minimum ||w||2. Minimizing ||w||2 is equivalent to finding separating hyperplane for which the distance between two classes, is maximized. This distance is called margin.

  9. Separating hyperplane and optimal separating hyperplane.

  10. Primal Optimization Problem • Primal problem

  11. Computing Saddle Points • The Lagrangian is • Optimality conditions

  12. Optimal point • Support vector: a training vector for which

  13. Dual Optimization Problem

  14. KKT Conditions

  15. Linearly Non-separable Case(Soft Margin Optimal Hyperplane)

  16. Lagrangian

  17. Saddle Point Optimality Conditions

  18. Dual Problem

  19. Dual Problem for k=1

  20. The Idea of SVM input space feature space     

  21. Non-Linear Case • If the data are nonlinear separable, we map the input variable x into a higher dimensional feature space. • If we map the input space to the feature space, then we will obtain a hyperplane that separates the data into two groups in the feature space. • Kernel function

  22. Dual problem in nonlinear case • replace the dot product of the inputs with the kernel function in linearly non separable case

  23. Support Vector Regression • The e- insensitive support vector regression:find a function f(x) that has e deviation from the actually obtained target yi for all training data and at the same time is as flat as possible.If • Primal Regression Problem

  24. Soft Margin Formulation • Soft Margin Formulation • C determines the trade off between the flatness of the f(x) and the amount up to which deviations larger than e are tolerated. • The e-insensitive loss function ||e (Vapnik 1995) is defined as

  25. e -insensitive case

  26. Saddle Point Optimality Conditions • Lagrangian function will help us to formulate the dual problem • Optimality Conditions

  27. Dual Problem for Regression • Dual Problem • Solving

  28. KKT Optimality Conditions and b* • KKT Optimality Conditions • only samples (xi,yi) with corresponding li = C lie outside the e-insensitive tube around f. If li is nonzero, then l*i is zero and vice versa. Finally if li is in (0,C) then corresponding  is zero. • b can be computed as follows

  29. QP SV Regression Problem in Feature Space • Mapping in the feature space we obtain the following quadratic SV regression problem • At the optimal solution, we obtain

  30. Kernel Functions in SVMs • An inner product in feature space has an equivalent kernel in input space • Any symmetric positive semi-definite function (Smola 1998), which satisfies the Mercer's Conditions can be used as kernel function in the SVM context. Mercer's Conditions can be written as

  31. Some kernel functions • Polynomial type: • Gaussian Radial Basis Function (GRBF): • Exponential Radial Basis Function: • Multi-Layer Perceptron: • Fourier Series:

  32. Open Problem • We have more than one kernel to map the input space into feature space. • Question: which kernel functions provide good generalization for a particular problem? • Some validation techniques, such as bootstrapping, and cross-validation can be used to determine a good kernel • Even when we decide for a kernel function, we have to compute the parameters of the kernel (e.g RBF has a parameter s and one has to decide the value of the s before the experiment). • No theory yet for selection of optimal kernels (Smola 1988, Amari 1999) • For a more extensive literature and software in SVMs check the web page http://svm.first.gmd.de/

More Related