Cs 9633 machine learning support vector machines
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

CS 9633 Machine Learning Support Vector Machines PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

CS 9633 Machine Learning Support Vector Machines. References: Cristianini, N. and B. Scholkopf, Support Vector Machines and Kernel Methods: A New Generation of Learning Machines, AI Magazine , Fall 2002.

Download Presentation

CS 9633 Machine Learning Support Vector Machines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 9633 machine learning support vector machines

CS 9633 Machine LearningSupport Vector Machines

References:

Cristianini, N. and B. Scholkopf, Support Vector Machines and Kernel Methods: A New Generation of Learning Machines, AI Magazine, Fall 2002.

Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd Edition, 1999, Prentice-Hall.

Muller, K.R., S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, 12(2), March 2001, pp. 181-2001.

Burges, J. C. “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, 2(2), 121-167, 1998.


Unique features of svm s and kernel methods

Unique Features of SVM’s and Kernel Methods

  • Are explicitly based on a theoretical model of learning

  • Come with theoretical guarantees about their performance

  • Have a modular design that allows one to separately implement and design their components

  • Are not affected by local minima

  • Do not suffer from the curse of dimensionality


Svms a new generation of learning algorithms

SVMs: A New Generation of Learning Algorithms

  • Pre 1980:

    • Almost all learning methods learned linear decision surfaces.

    • Linear learning methods have nice theoretical properties

  • 1980’s

    • Decision trees and NNs allowed efficient learning of non-linear decision surfaces

    • Little theoretical basis and all suffer from local minima

  • 1990’s

    • Efficient learning algorithms for non-linear functions based on computational learning theory developed

    • Nice theoretical properties.


Key ideas

Key Ideas

  • Two independent developments within last decade

    • Computational learning theory

    • New efficient representations of non-linear functions that use “kernel functions”

  • The resulting learning algorithm is an optimization algorithm rather than a greedy search.


Statistical learning theory

Statistical Learning Theory

  • Systems can be mathematically described as a system that

    • Receives data (observations) as input and

    • Outputs a function that can be used to predict some features of future data.

  • Statistical learning theory models this as a function estimation problem

  • Generalization Performance (accuracy in labeling test data) is measured


Organization

Organization

  • Basic idea of support vector machines

    • Optimal hyperplane for linearly separable patterns

    • Extend to patterns that are not linearly separable

  • SVM algorithm for pattern recognition


Optimal hyperplane for linearly separable patterns

Optimal Hyperplane for Linearly Separable Patterns

  • Set of n training examples (xi,di) where xi is the feature vector and di is the target output. Let di = +1 for positive examples and di = -1 for negative examples.

  • Assume that the the patterns are linearly separable.

  • Patterns can be separated by a hyper plane


2 dimensional example

2-Dimensional Example

X

X

X

X

X

X


Defining the hyper plane

Defining the Hyper Plane

  • Form of equation defining the decision surface separating the classes is a hyper plane of the form:

    wTx + b = 0

    • w is a weight vector

    • x is input vector

    • b is bias

  • Allows us to write

    wTx + b  0 for di = +1

    wTx + b < 0 for di = -1


Some definitions

Some definitions

  • Margin of Separation (): the separation between the hyper plane and the closest data point for a given weight vector w and bias b.

  • Optimal Hyper plane (maximal margin): the particular hyper plane for which the margin of separation  is maximized.


Cs 9633 machine learning support vector machines

Equation of Hyperplane

w0Tx+ b0 = 0

0

X

X

X

X

X

X


Cs 9633 machine learning support vector machines

Support Vectors: Input vectors for which

w0Tx+ b0 = 1 or w0Tx+ b0 = -1

0

X

X

X

X

X

X


Support vectors

Support Vectors

  • Support vectors are the data points that lie closest to the decision surface

  • They are the most difficult to classify

  • They have direct bearing on the optimum location of the decision surface

  • We can show that the optimal hyperplane stems from the function class with the lowest capacity (VC dimension).


Svm approach

SVM Approach

  • Map data into a dot product space using a non-linear mapping function

  • Perform maximal margin algorithm

 (x)

 (x)

o

x

 (o)

x

o

 (x)

x

 (x)

o

 (o)

x

 (x)

o

x

 (o)

o

 (x)

 (o)

x

 (o)

 (x)

x

 (o)

o

 (o)

X

F


Importance of vc dimension

Importance of VC dimension

  • The VC dimension is a purely combinatorial concept (not related to dimension)

  • Number of examples needed to learn a class of interest reliably is proportional to the VC dimension of the class

  • A larger VC dimension implies that it requires a more complex machine to reliably learn an accurate function.


Structural risk minimization

Structural Risk Minimization

  • Let  be a set of parameters of a learning machine (for example, in a neural network, it would be the set of weights and bias.)

  • Let h be the VC dimension (capacity) of a learning machine.

  • Consider an ensemble of pattern classifiers {F(x,)} with respect to input space X.

  • For a number of training examples N > h and simultaneously for all classification functions F(x,), the generalization error on the test data is lower than a “guaranteed” risk with probability with probability 1- .

  • We will use the term “risk bound” instead of “guaranteed risk”.


Risk bound

Risk Bound

  • The “empirical risk” is just the measured error rate on the training data.

  • The “loss” is the term:

  • One commonly used definition of the “risk bound” is:

    where

    N is the number of examples

    h is the VC dimension

     is the probability

    And the second term on the rhs is called the VC confidence


Implications of bound

Implications of Bound

  • Properties of Bound

    • Independent of probability distribution of data (assumed training and test data from same distribution).

    • Not usually possible to compute the actual risk R()

    • If we know h, we can easily compute the right hand side.

  • Implies that if we have several different learning machines (families of functions) we want to select the machine that minimizes the rhs


Cs 9633 machine learning support vector machines

Risk

Bound

VC Confidence

Error

Training error

VC dimension, h


Method of structural risk minimization

Method of Structural Risk Minimization

  • Training error for each pattern classifier is minimized

  • The pattern classifier with the smallest risk bound is identified. This classifier provides the best compromise between the training error and the complexity of the approximating function


Structural risk minimization1

Structural Risk Minimization

  • SRM finds the subset of functions that minimizes the bound on the actual risk

h4 h3 h2 h1

h1 < h2 < h3 < h4


Steps in srm

Steps in SRM

  • Train a series of machines, one for each subset where for each given subset the goal of training is to minimize the empirical risk

  • Select that trained machine in the series whose sum of empirical risk and VC confidence is minimal


Support vectors again for linearly separable case

Support Vectors again for linearly separable case

  • Support vectors are the elements of the training set that would change the position of the dividing hyper plane if removed.

  • Support vectors are the critical elements of the training set

  • The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers to get into a form that can be solved analytically).


Cs 9633 machine learning support vector machines

Equation of Hyperplane

w0Tx+ b0 = 0

0

X

X

X

X

X

X


Optimization problem

Optimization Problem


Nonlinear support vector machines

Nonlinear Support Vector Machines

  • How can we generalize previous result to the case where the decision function is not a linear function of the data? Answer: kernel functions

    • The only way in which the data appears in the training problem is in the form of dot products xixj

    • First map the data to some other (possibly infinite dimensional) space H using a mapping.

    • Training algorithm now only depends on data through dot products in H: (xi)(xj)

    • If there is a kernel function K such that

      K(xi,xj)=(xi)(xj)

      we would only need to use K in the training algorithm and would never need to know  explicitly. The conditions under which such kernel functions exist can be shown.


Inner product kernels

Inner Product Kernels


Support vector machine for pattern recognition

Support Vector Machine for Pattern Recognition

  • Two key ideas

    • Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the output and the input

    • Construction of an optimal hyperplan for separating the features descovered in step 1


  • Login