- 84 Views
- Uploaded on
- Presentation posted in: General

CS 9633 Machine Learning Support Vector Machines

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS 9633 Machine LearningSupport Vector Machines

References:

Cristianini, N. and B. Scholkopf, Support Vector Machines and Kernel Methods: A New Generation of Learning Machines, AI Magazine, Fall 2002.

Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd Edition, 1999, Prentice-Hall.

Muller, K.R., S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, 12(2), March 2001, pp. 181-2001.

Burges, J. C. “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, 2(2), 121-167, 1998.

- Are explicitly based on a theoretical model of learning
- Come with theoretical guarantees about their performance
- Have a modular design that allows one to separately implement and design their components
- Are not affected by local minima
- Do not suffer from the curse of dimensionality

- Pre 1980:
- Almost all learning methods learned linear decision surfaces.
- Linear learning methods have nice theoretical properties

- 1980’s
- Decision trees and NNs allowed efficient learning of non-linear decision surfaces
- Little theoretical basis and all suffer from local minima

- 1990’s
- Efficient learning algorithms for non-linear functions based on computational learning theory developed
- Nice theoretical properties.

- Two independent developments within last decade
- Computational learning theory
- New efficient representations of non-linear functions that use “kernel functions”

- The resulting learning algorithm is an optimization algorithm rather than a greedy search.

- Systems can be mathematically described as a system that
- Receives data (observations) as input and
- Outputs a function that can be used to predict some features of future data.

- Statistical learning theory models this as a function estimation problem
- Generalization Performance (accuracy in labeling test data) is measured

- Basic idea of support vector machines
- Optimal hyperplane for linearly separable patterns
- Extend to patterns that are not linearly separable

- SVM algorithm for pattern recognition

- Set of n training examples (xi,di) where xi is the feature vector and di is the target output. Let di = +1 for positive examples and di = -1 for negative examples.
- Assume that the the patterns are linearly separable.
- Patterns can be separated by a hyper plane

X

X

X

X

X

X

- Form of equation defining the decision surface separating the classes is a hyper plane of the form:
wTx + b = 0

- w is a weight vector
- x is input vector
- b is bias

- Allows us to write
wTx + b 0 for di = +1

wTx + b < 0 for di = -1

- Margin of Separation (): the separation between the hyper plane and the closest data point for a given weight vector w and bias b.
- Optimal Hyper plane (maximal margin): the particular hyper plane for which the margin of separation is maximized.

Equation of Hyperplane

w0Tx+ b0 = 0

0

X

X

X

X

X

X

Support Vectors: Input vectors for which

w0Tx+ b0 = 1 or w0Tx+ b0 = -1

0

X

X

X

X

X

X

- Support vectors are the data points that lie closest to the decision surface
- They are the most difficult to classify
- They have direct bearing on the optimum location of the decision surface
- We can show that the optimal hyperplane stems from the function class with the lowest capacity (VC dimension).

- Map data into a dot product space using a non-linear mapping function
- Perform maximal margin algorithm

(x)

(x)

o

x

(o)

x

o

(x)

x

(x)

o

(o)

x

(x)

o

x

(o)

o

(x)

(o)

x

(o)

(x)

x

(o)

o

(o)

X

F

- The VC dimension is a purely combinatorial concept (not related to dimension)
- Number of examples needed to learn a class of interest reliably is proportional to the VC dimension of the class
- A larger VC dimension implies that it requires a more complex machine to reliably learn an accurate function.

- Let be a set of parameters of a learning machine (for example, in a neural network, it would be the set of weights and bias.)
- Let h be the VC dimension (capacity) of a learning machine.
- Consider an ensemble of pattern classifiers {F(x,)} with respect to input space X.
- For a number of training examples N > h and simultaneously for all classification functions F(x,), the generalization error on the test data is lower than a “guaranteed” risk with probability with probability 1- .
- We will use the term “risk bound” instead of “guaranteed risk”.

- The “empirical risk” is just the measured error rate on the training data.
- The “loss” is the term:
- One commonly used definition of the “risk bound” is:
where

N is the number of examples

h is the VC dimension

is the probability

And the second term on the rhs is called the VC confidence

- Properties of Bound
- Independent of probability distribution of data (assumed training and test data from same distribution).
- Not usually possible to compute the actual risk R()
- If we know h, we can easily compute the right hand side.

- Implies that if we have several different learning machines (families of functions) we want to select the machine that minimizes the rhs

Risk

Bound

VC Confidence

Error

Training error

VC dimension, h

- Training error for each pattern classifier is minimized
- The pattern classifier with the smallest risk bound is identified. This classifier provides the best compromise between the training error and the complexity of the approximating function

- SRM finds the subset of functions that minimizes the bound on the actual risk

h4 h3 h2 h1

h1 < h2 < h3 < h4

- Train a series of machines, one for each subset where for each given subset the goal of training is to minimize the empirical risk
- Select that trained machine in the series whose sum of empirical risk and VC confidence is minimal

- Support vectors are the elements of the training set that would change the position of the dividing hyper plane if removed.
- Support vectors are the critical elements of the training set
- The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers to get into a form that can be solved analytically).

Equation of Hyperplane

w0Tx+ b0 = 0

0

X

X

X

X

X

X

- How can we generalize previous result to the case where the decision function is not a linear function of the data? Answer: kernel functions
- The only way in which the data appears in the training problem is in the form of dot products xixj
- First map the data to some other (possibly infinite dimensional) space H using a mapping.
- Training algorithm now only depends on data through dot products in H: (xi)(xj)
- If there is a kernel function K such that
K(xi,xj)=(xi)(xj)

we would only need to use K in the training algorithm and would never need to know explicitly. The conditions under which such kernel functions exist can be shown.

- Two key ideas
- Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the output and the input
- Construction of an optimal hyperplan for separating the features descovered in step 1