Loading in 5 sec....

CS 9633 Machine Learning Support Vector MachinesPowerPoint Presentation

CS 9633 Machine Learning Support Vector Machines

- By
**aldan** - Follow User

- 147 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'CS 9633 Machine Learning Support Vector Machines' - aldan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CS 9633 Machine LearningSupport Vector Machines

References:

Cristianini, N. and B. Scholkopf, Support Vector Machines and Kernel Methods: A New Generation of Learning Machines, AI Magazine, Fall 2002.

Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd Edition, 1999, Prentice-Hall.

Muller, K.R., S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, 12(2), March 2001, pp. 181-2001.

Burges, J. C. “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, 2(2), 121-167, 1998.

Unique Features of SVM’s and Kernel Methods

- Are explicitly based on a theoretical model of learning
- Come with theoretical guarantees about their performance
- Have a modular design that allows one to separately implement and design their components
- Are not affected by local minima
- Do not suffer from the curse of dimensionality

SVMs: A New Generation of Learning Algorithms

- Pre 1980:
- Almost all learning methods learned linear decision surfaces.
- Linear learning methods have nice theoretical properties

- 1980’s
- Decision trees and NNs allowed efficient learning of non-linear decision surfaces
- Little theoretical basis and all suffer from local minima

- 1990’s
- Efficient learning algorithms for non-linear functions based on computational learning theory developed
- Nice theoretical properties.

Key Ideas

- Two independent developments within last decade
- Computational learning theory
- New efficient representations of non-linear functions that use “kernel functions”

- The resulting learning algorithm is an optimization algorithm rather than a greedy search.

Statistical Learning Theory

- Systems can be mathematically described as a system that
- Receives data (observations) as input and
- Outputs a function that can be used to predict some features of future data.

- Statistical learning theory models this as a function estimation problem
- Generalization Performance (accuracy in labeling test data) is measured

Organization

- Basic idea of support vector machines
- Optimal hyperplane for linearly separable patterns
- Extend to patterns that are not linearly separable

- SVM algorithm for pattern recognition

Optimal Hyperplane for Linearly Separable Patterns

- Set of n training examples (xi,di) where xi is the feature vector and di is the target output. Let di = +1 for positive examples and di = -1 for negative examples.
- Assume that the the patterns are linearly separable.
- Patterns can be separated by a hyper plane

Defining the Hyper Plane

- Form of equation defining the decision surface separating the classes is a hyper plane of the form:
wTx + b = 0

- w is a weight vector
- x is input vector
- b is bias

- Allows us to write
wTx + b 0 for di = +1

wTx + b < 0 for di = -1

Some definitions

- Margin of Separation (): the separation between the hyper plane and the closest data point for a given weight vector w and bias b.
- Optimal Hyper plane (maximal margin): the particular hyper plane for which the margin of separation is maximized.

Support Vectors

- Support vectors are the data points that lie closest to the decision surface
- They are the most difficult to classify
- They have direct bearing on the optimum location of the decision surface
- We can show that the optimal hyperplane stems from the function class with the lowest capacity (VC dimension).

SVM Approach

- Map data into a dot product space using a non-linear mapping function
- Perform maximal margin algorithm

(x)

(x)

o

x

(o)

x

o

(x)

x

(x)

o

(o)

x

(x)

o

x

(o)

o

(x)

(o)

x

(o)

(x)

x

(o)

o

(o)

X

F

Importance of VC dimension

- The VC dimension is a purely combinatorial concept (not related to dimension)
- Number of examples needed to learn a class of interest reliably is proportional to the VC dimension of the class
- A larger VC dimension implies that it requires a more complex machine to reliably learn an accurate function.

Structural Risk Minimization

- Let be a set of parameters of a learning machine (for example, in a neural network, it would be the set of weights and bias.)
- Let h be the VC dimension (capacity) of a learning machine.
- Consider an ensemble of pattern classifiers {F(x,)} with respect to input space X.
- For a number of training examples N > h and simultaneously for all classification functions F(x,), the generalization error on the test data is lower than a “guaranteed” risk with probability with probability 1- .
- We will use the term “risk bound” instead of “guaranteed risk”.

Risk Bound

- The “empirical risk” is just the measured error rate on the training data.
- The “loss” is the term:
- One commonly used definition of the “risk bound” is:
where

N is the number of examples

h is the VC dimension

is the probability

And the second term on the rhs is called the VC confidence

Implications of Bound

- Properties of Bound
- Independent of probability distribution of data (assumed training and test data from same distribution).
- Not usually possible to compute the actual risk R()
- If we know h, we can easily compute the right hand side.

- Implies that if we have several different learning machines (families of functions) we want to select the machine that minimizes the rhs

Method of Structural Risk Minimization

- Training error for each pattern classifier is minimized
- The pattern classifier with the smallest risk bound is identified. This classifier provides the best compromise between the training error and the complexity of the approximating function

Structural Risk Minimization

- SRM finds the subset of functions that minimizes the bound on the actual risk

h4 h3 h2 h1

h1 < h2 < h3 < h4

Steps in SRM

- Train a series of machines, one for each subset where for each given subset the goal of training is to minimize the empirical risk
- Select that trained machine in the series whose sum of empirical risk and VC confidence is minimal

Support Vectors again for linearly separable case

- Support vectors are the elements of the training set that would change the position of the dividing hyper plane if removed.
- Support vectors are the critical elements of the training set
- The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers to get into a form that can be solved analytically).

Nonlinear Support Vector Machines

- How can we generalize previous result to the case where the decision function is not a linear function of the data? Answer: kernel functions
- The only way in which the data appears in the training problem is in the form of dot products xixj
- First map the data to some other (possibly infinite dimensional) space H using a mapping.
- Training algorithm now only depends on data through dot products in H: (xi)(xj)
- If there is a kernel function K such that
K(xi,xj)=(xi)(xj)

we would only need to use K in the training algorithm and would never need to know explicitly. The conditions under which such kernel functions exist can be shown.

Support Vector Machine for Pattern Recognition

- Two key ideas
- Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the output and the input
- Construction of an optimal hyperplan for separating the features descovered in step 1

Download Presentation

Connecting to Server..