- 193 Views
- Updated On :

Download Presentation
## PowerPoint Slideshow about '' - achilles

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Kernel – Based Methods

Presented by

Jason Friedman

Lena Gorelick

Advanced Topics in Computer and Human Vision

Spring 2003

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

Structural Risk Minimization (SRM)

- Definition:
- Training set with l observations:Each observation consists of a pair:

16x16=256

Structural Risk Minimization (SRM)

- The task:“Generalization” - find a mapping
- Assumption: Training and test data drawn from the same probability distribution, i.e.(x,y) is “similar” to (x1,y1), …, (xl,yl)

Structural Risk Minimization (SRM) – Learning Machine

- Definition:
- Learning machine is a family of functions {f()}, is a set of parameters.
- For a task of learning two classes f(x,) 2 {-1,1} 8 x,

Class of oriented lines in R2:sign(1x1 + 2x2 + 3)

Too much Capacity

?

?

Does it have the same # of leaves?

Is the color green?

overfitting

underfitting

Structural Risk Minimization (SRM) – Capacity vs. Generalization- Definition:
- Capacity of a learning machine measures the ability to learn any training set without error.

Structural Risk Minimization (SRM) – Capacity vs. Generalization

- For small sample sizes overfitting or underfitting might occur
- Best generalization = right balance between accuracy and capacity

Structural Risk Minimization (SRM) – Capacity vs. Generalization

- Solution:
Restrict the complexity (capacity) of the function class.

- Intuition:
“Simple” function that explains most of the data is preferable to a “complex” one.

Structural Risk Minimization (SRM) -VC dimension

- What is a “simple”/”complex” function?
- Definition:
- Given l points (can be labeled in 2l ways)
- The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.

Structural Risk Minimization (SRM) -VC dimension

- Definition
- VC dimension of {f()} is the maximum number of points that can be shattered by {f()} and is a measure of capacity.

Structural Risk Minimization (SRM) -VC dimension

- Theorem:
The VC dimension of the set of orientedhyperplanes in Rn is n+1.

- Low # of parameters ) low VC dimension

Structural Risk Minimization (SRM) -Bounds

- Definition:
Actual risk

- Minimize R()
- But, we can’t measure actual risk, since we don’t know p(x,y)

Structural Risk Minimization (SRM) -Bounds

- Definition:
Empirical risk

- Remp() ! R(), l!1But for small training set deviations might occur

Structural Risk Minimization (SRM) -Bounds

Not valid for infinite VC dimension

- Risk bound:

Confidence term

with probability (1-)

h is VC dimension of the function class

- Note: R() is independent of p(x,y)

Structural Risk Minimization (SRM)-Principal Method

- Principle method for choosing a learning machine for a given task:

Complexity

SRM- Divide the class of functions into nested subsets
- Either calculate h for each subset, or get a bound on it
- Train each subset to achieve minimal empirical error
- Choose the subset with the minimal risk bound

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

Support Vector Machines (SVM)

- Currently the “en vogue” approach to classification
- Successful applications in bioinformatics, text, handwriting recognition, image processing
- Introduced by Bosner, Gayon and Vapnik, 1992
- SVM are a particular instance of Kernel Machines

Linear SVM – Separable case

- Two given classes are linearly separable

Linear SVM - definitions

- Separating hyperplane H:
- w is normal to H
- |b|/||w|| is the perpendicular distance from H to the origin
- d+ (d-) is the shortest distance from H to the closest positive (negative) point.

Linear SVM - definitions

- If H is a separating hyperplane, then

- No training points fall between H1 and H2

Linear SVM - definitions

- By scaling w and b, we can require that

Or more simply:

- Equality holds xi lies on H1 or H2

Linear SVM - definitions

- Note: w is no longer a unit vector
- Margin is now 2 / ||w||
- Find hyperplane with the largest margin.

Linear SVM – maximizing margin

- Maximizing the margin , minimizing ||w||2
- ) more room for unseen points to fall
- ) restrict the capacity
R is the radius of the smallest ball around data

Linear SVM – Constrained Optimization

- Introduce Lagrange multipliers

- “Primal” formulation:
- Minimize LP with respect to w and bRequire

Linear SVM – Constrained Optimization

- Objective function is quadratic
- Linear constraint defines a convex set
- Intersection of convex sets is a convex set
- ) can formulate “WolfeDual” problem

Linear SVM – Constrained Optimization

The

Solution

- Maximize LP with respect to i
Require

- Substitute into LP to give:
- Maximize with respect to i

Linear SVM – Constrained Optimization

- Using Karush Kuhn Tuckerconditions:
- If i > 0 then lies either on H1 or H2) The solution is sparse in i
- Those training points are called “support vectors”. Their removal would change the solution

SVM – Test Phase

- Given the unseen sample x we take the class of x to be

Linear SVM – Non-separable case

- Separable case corresponds to empirical risk of zero.
- For noisy data this might not be the minimum in the actual risk. (overfitting )
- No feasible solution for non-separable case

Linear SVM – Non-separable case

- Relax the constraints by introducing positive slack variables i

- is an upper bound on the number of errors

Linear SVM – Non-separable case

- Assign extra cost to errors
- Minimize

where C is a penalty parameterchosen by the user

Linear SVM – Non-separable case

- Lagrange formulation again:

Lagrange multiplier

- “Wolfe Dual” problem - maximize:subject to:
- The solution:

Linear SVM – Non-separable case

- Using Karush Kuhn Tucker conditions:
- The solution is sparse in i

Nonlinear SVM

- Non linear decision function might be needed

Nonlinear SVM- Feature Space

- Map the data to a high dimensional (possibly infinite) feature space
- Solution depends on
- If there were function k(xi,xj) s.t.) no need to know explicitly

Nonlinear SVM – Avoid the Curse

- Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension
- But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)

Nonlinear SVM-Kernel Functions

- Kernel functions exist!
- effectively compute dot products in feature space

- Can use it without knowing and F
- Given a kernel, and F are not unique
- F with smallest dim is calledminimal embedding space

Nonlinear SVM-Kernel Functions

- Mercer’s condition:There exists a pair {,F} such thatiff for any g(x) s.t. is finitethen

Nonlinear SVM-Kernel Functions

- Formulation of algorithm in terms of kernels

Nonlinear SVM-Kernel Functions

- Kernels frequently used:

Nonlinear SVM-Feature Space

d=256, p=4 )

dim(F)=

183,181,376

- Hyperplane {w,b} requires dim(F) + 1 parameters
- Solving SVM means adjusting l+1 parameters

SVM - Solution

- LD is convex ) the solution is global
- Two type of non-uniqueness:
- {w,b} is not unique
- {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)

Methods of solution

- Quadratic programming packages
- Chunking
- Decomposition methods
- Sequential minimal optimization

Applications - SVM

Extracting Support Data for a Given Task

Bernhard Schölkopf Chris Burges Vladimir Vapnik

Proceedings, First International Conference on

Knowledge Discovery & Data Mining. 1995

Applications - SVM

- Input (USPS handwritten digits)
- Training set: 7300
- Testing set: 2000

- Constructed:
- 10 class/non-class SVM classifiers
- Take the class with maximal output

16x16

Applications - SVM

- Three different types of classifiers for each digit:

Applications - SVM

Predicting the Optimal Decision Functions - SRM

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

Feature Space vs. Input Space

- Suppose the solution is a linear combination in feature space
- Cannot generally say that each point has a preimage

Feature Space vs. Input Space

- If there exists z such that
and k is an invertible function fk of (x¢ y)then can compute z as

where {e1,…,eN} is an orthonormal basis of the input space.

Feature Space vs. Input Space

- Polynomial kernel is invertible when
- Then the preimage of w is given by

Feature Space vs. Input space

- In general, cannot find the preimage
- Look for an approximation such that
is small

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

PCA

- Regular PCA:Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance.
- u is the eigenvector of covariance matrix Cu=u.

Kernel PCA

- Extension to feature space:
- compute covariance matrix
- solve eigenvalue problem CV=V
)

Kernel PCA – Extracting Features

- To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors

Kernel PCA

- Kernel PCA is used for:
- De-noising
- Compression
- Interpretation (Visualization)
- Extract features for classifiers

Kernel PCA - Toy example

- Regular PCA:

Applications – Kernel PCA

Kernel PCA Pattern Reconstruction via Approximate Pre-Images

B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller.

In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks, Perspectives in Neural Computing, pages 147-152, Berlin, 1998. Springer Verlag.

Applications – Kernel PCA

- Recall
- (x) can be reconstructed from its principal components
- Projection operator:

Applications – Kernel PCA

Denoising

- When ,) can’t guarantee existence of pre-image
- First n directions ! main structure
- The remaining directions ! noise

Applications – Kernel PCA

- Find z that minimizes
- Use gradient descent starting with x

Applications – Kernel PCA

- Input toy data:
3 point sources

(100 points each)

with Gaussian noise

=0.1

- Using RBF

Applications – Kernel PCA

Compare to the linear PCA

Applications – Kernel PCA

Real world data

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

Fisher Linear Discriminant

Finds a direction w, projected on which the classes are “best” separated

Fisher Linear Discriminant

- For “best” separation - maximizewhere is the projected mean of class i, and is the std.

Fisher Linear Discriminant

- Equivalent to finding w which maximizes:where

Fisher Linear Discriminant

- The solution is given by:

Kernel Fisher Discriminant

- Kernel formulation:where

Kernel Fisher Discriminant

- From the theory of reproducing kernels:
- Substituting it into the J(w) reduces the problem to maximizing:

Kernel Fisher Discriminant

- Solution is to solve the generalized eigenproblem:
- Projection of a new pattern is then:
- Find a suitable threshold

l x l rather than d x d

Kernel Fisher Discriminant – Constraint Optimization

- Can formulate problem as constraint optimization

w

Applications – Fisher Discriminant Analysis

Fisher Discriminant Analysis with Kernels

S.Mika et. al.

In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE, 1999.

Applications – Fisher Discriminant Analysis

- Input (USPS handwritten digits):
- Training set: 3000

- Constructed:
- 10 class/non-class KFD classifiers
- Take the class with maximal output

Applications – Fisher Discriminant Analysis

- Results:3.7% error on a ten-class classifierUsing RBF with = 0.3*256
- Compare to 4.2% using SVM
- KFDA vs. SVM

- Structural Risk Minimization (SRM)
- Support Vector Machines (SVM)
- Feature Space vs. Input Space
- Kernel PCA
- Kernel Fisher Discriminate Analysis (KFDA)

Download Presentation

Connecting to Server..