Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

1 / 32

# Tutorial: Interior Point Optimization Methods in Support Vector Machines Training - PowerPoint PPT Presentation

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training . Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999. Outline. Statistical Learning Theory Empirical Risk Minimization

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Tutorial: Interior Point Optimization Methods in Support Vector Machines Training' - meris

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Tutorial:Interior Point Optimization Methodsin Support Vector Machines Training

Part 1: Fundamentals of SVMs

Theodore Trafalis

email: trafalis@ecn.ou.edu

ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999

Outline
• Statistical Learning Theory
• Empirical Risk Minimization
• Structural Risk Minimization
• Linear SVM and Linear Separable Case
• Primal Optimization Problem
• Dual Optimization Problem
• Non-Linear Case
• Support Vector Regression
• Dual Problem for Regression
• Kernel Functions in SVMs
• Open Problem
Statistical Learning Theory (Vapnik 1995,1998)

Empirical Risk Minimization

• Given a set of decision functions {f(x): l} ,

f : n [-1,1] where is set of abstract parameters.

• Suppose (x1,y1), (x2,y2), ..., (xl, yl) are such that x n,

y {1,-1} are taken from an unknown distribution P(x,y).

We want to find a f* which minimizes the expected risk

functional

where f(x), { f(x): l}are called hypothesis and hypothesis space, respectively.

Empirical Risk Minimization
• The problem is that the distribution function P(x,y) is unknown. We can not compute the expected risk. Instead we compute the empirical risk
• The idea behind minimizing empirical risk is that if Remp converges to the expected risk, then minimum of Remp may converge to the minimum of expected risk.
• A typical uniform VC bound, which holds with probability 1-, has the following form
Structural Risk Minimization
• A small value of the empirical risk does not necessarily imply a small value of expected risk.
• Structural Risk Minimization Principle (SRM) (Vapnik 1982,1995): VC dimension and empirical risk should be minimized at the same time.
• Need a nested structure of hypothesis space
• H1 H2 H3 ...  Hn.....
• with the property that h(n)  h(n+1), where h(n) is the VC dimension of the Hn.
• Need to solve the following problem
Linear SVM and Linear Separable Case
• Assume that we are given a set S of points xin where each xi belongs to either of two classes defined by yi {1,-1}. The objective is to find a hyperplane that divides S leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane [Vapnik 1995].
• Definition 1. The set S is linearly separable if there exists a w n and a b  such that
• In order to make each decision surface corresponding to one unique pair (w,b), the following constraint is imposed.
• Suppose all points x1, x2, x3, ... ,x1 lie in the n-unit dimensional sphere. The set

has a VC dimension h that satisfies the following bound

h  min {A2, L} + 1

• Maximizing margin minimizes the function complexity
continued
• The distance from a point x to the hyperplane associated to the pair (w,b) is
• The distance between canonical hyperplane and the closest point is
• The goal of the SVM is to find, among all the hyperplanes that correctly

classify the data with the minimum norm, or minimum ||w||2.

Minimizing ||w||2 is equivalent to finding separating hyperplane for

which the distance between two classes, is maximized. This distance is

called margin.

• The Lagrangian is
• Optimality conditions
Optimal point
• Support vector: a training vector for which
The Idea of SVM

input space feature space

Non-Linear Case
• If the data are nonlinear separable, we map the input variable x into a higher dimensional feature space.
• If we map the input space to the feature space, then we will obtain a hyperplane that separates the data into two groups in the feature space.
• Kernel function
Dual problem in nonlinear case
• replace the dot product of the inputs with the kernel function in linearly non separable case
Support Vector Regression
• The e- insensitive support vector regression:find a function f(x) that has e deviation from the actually obtained target yi for all training data and at the same time is as flat as possible.If
• Primal Regression Problem
Soft Margin Formulation
• Soft Margin Formulation
• C determines the trade off between the flatness of the f(x) and the amount up to which deviations larger than e are tolerated.
• The e-insensitive loss function ||e (Vapnik 1995) is defined as
• Lagrangian function will help us to formulate the dual problem
• Optimality Conditions
Dual Problem for Regression
• Dual Problem
• Solving
KKT Optimality Conditions and b*
• KKT Optimality Conditions
• only samples (xi,yi) with corresponding li = C lie outside the e-insensitive tube around f. If li is nonzero, then l*i is zero and vice versa. Finally if li is in (0,C) then corresponding  is zero.
• b can be computed as follows
QP SV Regression Problem in Feature Space
• Mapping in the feature space we obtain the following quadratic SV regression problem
• At the optimal solution, we obtain
Kernel Functions in SVMs
• An inner product in feature space has an equivalent kernel in input space
• Any symmetric positive semi-definite function (Smola 1998), which satisfies the Mercer's Conditions can be used as kernel function in the SVM context. Mercer's Conditions can be written as
Some kernel functions
• Polynomial type:
• Gaussian Radial Basis Function (GRBF):