- 125 Views
- Uploaded on
- Presentation posted in: General

SVM Support Vectors Machines

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

SVMSupport Vectors Machines

Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini

Presented By: Tamer Salman

- SVM can deal with three kinds of problems:
- Pattern Recognition / Classification.
- Regression Estimation.
- Density Estimation.

- Given:
- A set of M labeled patterns:
- The patterns are drawn i.i.d from an unknown P(X,Y).
- A set of functions F.

- Chose a function f in F, such that an unseen pattern x will be correctly classified with high probability?
- Binary classification: Two classes, +1 and -1.

- What is the probability for error of a function f?
where c is some cost function on errors.

- The risk is not computable due to dP(x,y).
- A proper estimation must be found.

Linear Neural Network

Linear SVM

- Linear SVM produces the maximal margin hyper plane, which is as far as possible from the closest training points.

- Given the training set, we seek w and b such that:
- In Addition, we seek the maximal margin hyperplane.
- What is the margin?
- How do we maximize it?

- The margin is the sum of distances of the two closest points from each side to the hyper plane.
- The distance of the hyper plane (w,b) from the origin is w/b.
- The margin is 2/||w||.
- Maximizing the margin is equivalent to minimizing ½||w||².

- The LaGrangian is:

- Requiring the derivatives with respect to w,b to vanish yields:
- KKT conditions yield:
- Where:

- The resulting separating function is:
- Notes:
- The points with α=0 do not affect the solution.
- The points with α≠0 are called support vectors.
- The equality conditions hold true only for the SVs.

- We introduce slack variables ξi and allow mistakes.
- We demand:
- And minimize:

- The modifications yield the following problem:

- Note that the training data appears in the solution only in inner products.
- If we pre-map the data into a higher and sparser space we can get more separability and a stronger separation family of functions.
- The pre-mapping might make the problem infeasible.
- We want to avoid pre-mapping and still have the same separation ability.
- Suppose we have a simple function that operates on two training points and implements an inner product of their pre-mappings, then we achieve better separation with no added cost.

- A Mercer kernel is a function:
for which there exists a function:

such that:

- A funtion k(.,.) is a Mercer kernel if
for any function g(.), such that:

the following holds true:

- Homogeneous Polynomial Kernels:
- Non-homogeneous Polynomial Kernels:
- Radial Basis Function (RBF) Kernels:

- The problem:
- The separating function:

- The solutions of non-linear SVM is linear in H (Feature Space).
- In non-linear SVM w exists in H.
- The complexity of computing the kernel values is not higher than the complexity of the solution and can be done a priory in a kernel matrix.
- SVM is suitable for large scale problems due to chunking ability.

- Due to the fact that the actual risk is not computable, we seek to estimate the error rate of a machine given a finite set of m patterns.
- Empirical Risk.
- Training and Testing.
- k-fold cross validation.
- Leave One out.

- We seek faster estimates of the solution.
- The bound should be tight and informative.
- Theoretical VC bound:
Risk < Empirical Risk + Complexity (VC-dimension / m)

Loose and not always informative.

- Margin Radius bound:
Risk < R² / margin²

Where R is the radius of the smallest enclosing sphere of the data in feature space.

Tight and informative.

Error

Bound

LOO Error

Parameter

- One of the tightest sample-based bounds depend on the Rademacher Complexity term defined as follows:
where:

F is the class of functions mapping the domain of the input into R.

Ep(x) expectation with respect to the probability distribution of the input data.

Eσexpectation with respect to σi: independent uniform random variable of {±1}

- Rademacher complexity is a measure of the ability of the class of resulting functions to classify the input samples if associated with a random class.

- The following bound holds true with probability (1-δ):
Where:

Êm is the error on the input data measured through a loss function h(.) with Lipshitz constant L. That is:

And the loss function can be one of:

Vapnik’s:Bartlett & Mendelson’s: