Kernel based methods
Download
1 / 96

- PowerPoint PPT Presentation


  • 193 Views
  • Updated On :

Kernel – Based Methods . Presented by Jason Friedman Lena Gorelick. Advanced Topics in Computer and Human Vision Spring 2003. Agenda…. Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - achilles


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Kernel based methods l.jpg

Kernel – Based Methods

Presented by

Jason Friedman

Lena Gorelick

Advanced Topics in Computer and Human Vision

Spring 2003


Slide2 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Slide3 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Structural risk minimization srm l.jpg
Structural Risk Minimization (SRM)

  • Definition:

    • Training set with l observations:Each observation consists of a pair:

16x16=256


Structural risk minimization srm5 l.jpg
Structural Risk Minimization (SRM)

  • The task:“Generalization” - find a mapping

  • Assumption: Training and test data drawn from the same probability distribution, i.e.(x,y) is “similar” to (x1,y1), …, (xl,yl)


Structural risk minimization srm learning machine l.jpg
Structural Risk Minimization (SRM) – Learning Machine

  • Definition:

    • Learning machine is a family of functions {f()},  is a set of parameters.

    • For a task of learning two classes f(x,) 2 {-1,1} 8 x,

Class of oriented lines in R2:sign(1x1 + 2x2 + 3)


Structural risk minimization srm capacity vs generalization l.jpg

Too little Capacity

Too much Capacity

?

?

Does it have the same # of leaves?

Is the color green?

overfitting

underfitting

Structural Risk Minimization (SRM) – Capacity vs. Generalization

  • Definition:

    • Capacity of a learning machine measures the ability to learn any training set without error.


Structural risk minimization srm capacity vs generalization8 l.jpg
Structural Risk Minimization (SRM) – Capacity vs. Generalization

  • For small sample sizes overfitting or underfitting might occur

  • Best generalization = right balance between accuracy and capacity


Structural risk minimization srm capacity vs generalization9 l.jpg
Structural Risk Minimization (SRM) – Capacity vs. Generalization

  • Solution:

    Restrict the complexity (capacity) of the function class.

  • Intuition:

    “Simple” function that explains most of the data is preferable to a “complex” one.


Structural risk minimization srm vc dimension l.jpg
Structural Risk Minimization (SRM) -VC dimension

  • What is a “simple”/”complex” function?

  • Definition:

    • Given l points (can be labeled in 2l ways)

    • The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.


Structural risk minimization srm vc dimension11 l.jpg
Structural Risk Minimization (SRM) -VC dimension

  • Definition

    • VC dimension of {f()} is the maximum number of points that can be shattered by {f()} and is a measure of capacity.


Structural risk minimization srm vc dimension12 l.jpg
Structural Risk Minimization (SRM) -VC dimension

  • Theorem:

    The VC dimension of the set of orientedhyperplanes in Rn is n+1.

  • Low # of parameters ) low VC dimension


Structural risk minimization srm bounds l.jpg
Structural Risk Minimization (SRM) -Bounds

  • Definition:

    Actual risk

  • Minimize R()

  • But, we can’t measure actual risk, since we don’t know p(x,y)


Structural risk minimization srm bounds14 l.jpg
Structural Risk Minimization (SRM) -Bounds

  • Definition:

    Empirical risk

  • Remp() ! R(), l!1But for small training set deviations might occur


Structural risk minimization srm bounds15 l.jpg
Structural Risk Minimization (SRM) -Bounds

Not valid for infinite VC dimension

  • Risk bound:

Confidence term

with probability (1-)

h is VC dimension of the function class

  • Note: R() is independent of p(x,y)




Structural risk minimization srm principal method l.jpg
Structural Risk Minimization (SRM)-Principal Method

  • Principle method for choosing a learning machine for a given task:


Slide19 l.jpg

Risk Bound

Complexity

SRM

  • Divide the class of functions into nested subsets

  • Either calculate h for each subset, or get a bound on it

  • Train each subset to achieve minimal empirical error

  • Choose the subset with the minimal risk bound


Slide20 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Support vector machines svm l.jpg
Support Vector Machines (SVM)

  • Currently the “en vogue” approach to classification

  • Successful applications in bioinformatics, text, handwriting recognition, image processing

  • Introduced by Bosner, Gayon and Vapnik, 1992

  • SVM are a particular instance of Kernel Machines


Linear svm separable case l.jpg
Linear SVM – Separable case

  • Two given classes are linearly separable


Linear svm definitions l.jpg
Linear SVM - definitions

  • Separating hyperplane H:

  • w is normal to H

  • |b|/||w|| is the perpendicular distance from H to the origin

  • d+ (d-) is the shortest distance from H to the closest positive (negative) point.



Linear svm definitions25 l.jpg
Linear SVM - definitions

  • If H is a separating hyperplane, then

  • No training points fall between H1 and H2


Linear svm definitions26 l.jpg
Linear SVM - definitions

  • By scaling w and b, we can require that

Or more simply:

  • Equality holds  xi lies on H1 or H2


Linear svm definitions27 l.jpg
Linear SVM - definitions

  • Note: w is no longer a unit vector

  • Margin is now 2 / ||w||

  • Find hyperplane with the largest margin.


Linear svm maximizing margin l.jpg
Linear SVM – maximizing margin

  • Maximizing the margin , minimizing ||w||2

  • ) more room for unseen points to fall

  • ) restrict the capacity

    R is the radius of the smallest ball around data


Linear svm constrained optimization l.jpg
Linear SVM – Constrained Optimization

  • Introduce Lagrange multipliers

  • “Primal” formulation:

  • Minimize LP with respect to w and bRequire


Linear svm constrained optimization30 l.jpg
Linear SVM – Constrained Optimization

  • Objective function is quadratic

  • Linear constraint defines a convex set

  • Intersection of convex sets is a convex set

  • ) can formulate “WolfeDual” problem


Linear svm constrained optimization31 l.jpg
Linear SVM – Constrained Optimization

The

Solution

  • Maximize LP with respect to i

    Require

  • Substitute into LP to give:

  • Maximize with respect to i


Linear svm constrained optimization32 l.jpg
Linear SVM – Constrained Optimization

  • Using Karush Kuhn Tuckerconditions:

  • If i > 0 then lies either on H1 or H2) The solution is sparse in i

  • Those training points are called “support vectors”. Their removal would change the solution


Svm test phase l.jpg
SVM – Test Phase

  • Given the unseen sample x we take the class of x to be


Linear svm non separable case l.jpg
Linear SVM – Non-separable case

  • Separable case corresponds to empirical risk of zero.

  • For noisy data this might not be the minimum in the actual risk. (overfitting )

  • No feasible solution for non-separable case


Linear svm non separable case35 l.jpg
Linear SVM – Non-separable case

  • Relax the constraints by introducing positive slack variables i

  • is an upper bound on the number of errors


Linear svm non separable case36 l.jpg
Linear SVM – Non-separable case

  • Assign extra cost to errors

  • Minimize

where C is a penalty parameterchosen by the user


Linear svm non separable case37 l.jpg
Linear SVM – Non-separable case

  • Lagrange formulation again:

Lagrange multiplier

  • “Wolfe Dual” problem - maximize:subject to:

  • The solution:


Linear svm non separable case38 l.jpg
Linear SVM – Non-separable case

  • Using Karush Kuhn Tucker conditions:

  • The solution is sparse in i


Nonlinear svm l.jpg
Nonlinear SVM

  • Non linear decision function might be needed


Nonlinear svm feature space l.jpg
Nonlinear SVM- Feature Space

  • Map the data to a high dimensional (possibly infinite) feature space

  • Solution depends on

  • If there were function k(xi,xj) s.t.) no need to know  explicitly


Nonlinear svm toy example l.jpg
Nonlinear SVM – Toy example

Input Space

Feature Space


Nonlinear svm avoid the curse l.jpg
Nonlinear SVM – Avoid the Curse

  • Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension

  • But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)


Nonlinear svm kernel functions l.jpg
Nonlinear SVM-Kernel Functions

  • Kernel functions exist!

    • effectively compute dot products in feature space

  • Can use it without knowing  and F

  • Given a kernel,  and F are not unique

  • F with smallest dim is calledminimal embedding space


Nonlinear svm kernel functions44 l.jpg
Nonlinear SVM-Kernel Functions

  • Mercer’s condition:There exists a pair {,F} such thatiff for any g(x) s.t. is finitethen


Nonlinear svm kernel functions45 l.jpg
Nonlinear SVM-Kernel Functions

  • Formulation of algorithm in terms of kernels


Nonlinear svm kernel functions46 l.jpg
Nonlinear SVM-Kernel Functions

  • Kernels frequently used:


Nonlinear svm feature space47 l.jpg
Nonlinear SVM-Feature Space

d=256, p=4 )

dim(F)=

183,181,376

  • Hyperplane {w,b} requires dim(F) + 1 parameters

  • Solving SVM means adjusting l+1 parameters


Svm solution l.jpg
SVM - Solution

  • LD is convex ) the solution is global

  • Two type of non-uniqueness:

    • {w,b} is not unique

    • {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)





Methods of solution l.jpg
Methods of solution

  • Quadratic programming packages

  • Chunking

  • Decomposition methods

  • Sequential minimal optimization


Applications svm l.jpg
Applications - SVM

Extracting Support Data for a Given Task

Bernhard Schölkopf Chris Burges Vladimir Vapnik

Proceedings, First International Conference on

Knowledge Discovery & Data Mining. 1995


Applications svm54 l.jpg
Applications - SVM

  • Input (USPS handwritten digits)

    • Training set: 7300

    • Testing set: 2000

  • Constructed:

    • 10 class/non-class SVM classifiers

    • Take the class with maximal output

16x16


Applications svm55 l.jpg
Applications - SVM

  • Three different types of classifiers for each digit:



Applications svm57 l.jpg
Applications - SVM

Predicting the Optimal Decision Functions - SRM



Slide59 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Feature space vs input space l.jpg
Feature Space vs. Input Space

  • Suppose the solution is a linear combination in feature space

  • Cannot generally say that each point has a preimage


Feature space vs input space61 l.jpg
Feature Space vs. Input Space

  • If there exists z such that

    and k is an invertible function fk of (x¢ y)then can compute z as

    where {e1,…,eN} is an orthonormal basis of the input space.


Feature space vs input space62 l.jpg
Feature Space vs. Input Space

  • Polynomial kernel is invertible when

  • Then the preimage of w is given by


Feature space vs input space63 l.jpg
Feature Space vs. Input space

  • In general, cannot find the preimage

  • Look for an approximation such that

    is small


Slide64 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Slide65 l.jpg
PCA

  • Regular PCA:Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance.

  • u is the eigenvector of covariance matrix Cu=u.


Kernel pca l.jpg
Kernel PCA

  • Extension to feature space:

    • compute covariance matrix

    • solve eigenvalue problem CV=V

      )


Kernel pca67 l.jpg
Kernel PCA

  • Define in terms of dot products:

  • Then the problem becomes:

    where

l x l rather than d x d


Kernel pca extracting features l.jpg
Kernel PCA – Extracting Features

  • To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors


Kernel pca69 l.jpg
Kernel PCA

  • Kernel PCA is used for:

    • De-noising

    • Compression

    • Interpretation (Visualization)

    • Extract features for classifiers




Applications kernel pca l.jpg
Applications – Kernel PCA

Kernel PCA Pattern Reconstruction via Approximate Pre-Images

B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller.

In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks, Perspectives in Neural Computing, pages 147-152, Berlin, 1998. Springer Verlag.


Applications kernel pca73 l.jpg
Applications – Kernel PCA

  • Recall

  • (x) can be reconstructed from its principal components

  • Projection operator:


Applications kernel pca74 l.jpg
Applications – Kernel PCA

Denoising

  • When ,) can’t guarantee existence of pre-image

  • First n directions ! main structure

  • The remaining directions ! noise


Applications kernel pca75 l.jpg
Applications – Kernel PCA

  • Find z that minimizes

  • Use gradient descent starting with x


Applications kernel pca76 l.jpg
Applications – Kernel PCA

  • Input toy data:

    3 point sources

    (100 points each)

    with Gaussian noise

    =0.1

  • Using RBF





Applications kernel pca80 l.jpg
Applications – Kernel PCA

Compare to the linear PCA



Slide82 l.jpg

Agenda…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


Fisher linear discriminant l.jpg
Fisher Linear Discriminant

Finds a direction w, projected on which the classes are “best” separated


Fisher linear discriminant84 l.jpg
Fisher Linear Discriminant

  • For “best” separation - maximizewhere is the projected mean of class i, and is the std.


Fisher linear discriminant85 l.jpg
Fisher Linear Discriminant

  • Equivalent to finding w which maximizes:where


Fisher linear discriminant86 l.jpg
Fisher Linear Discriminant

  • The solution is given by:


Kernel fisher discriminant l.jpg
Kernel Fisher Discriminant

  • Kernel formulation:where


Kernel fisher discriminant88 l.jpg
Kernel Fisher Discriminant

  • From the theory of reproducing kernels:

  • Substituting it into the J(w) reduces the problem to maximizing:



Kernel fisher discriminant90 l.jpg
Kernel Fisher Discriminant

  • Solution is to solve the generalized eigenproblem:

  • Projection of a new pattern is then:

  • Find a suitable threshold

l x l rather than d x d


Kernel fisher discriminant constraint optimization l.jpg
Kernel Fisher Discriminant – Constraint Optimization

  • Can formulate problem as constraint optimization

w


Kernel fisher discriminant toy example l.jpg
Kernel Fisher Discriminant –Toy Example

KFDA

KPCA – 1st eigenvector

KPCA – 2nd eigenvector


Applications fisher discriminant analysis l.jpg
Applications – Fisher Discriminant Analysis

Fisher Discriminant Analysis with Kernels

S.Mika et. al.

In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE, 1999.


Applications fisher discriminant analysis94 l.jpg
Applications – Fisher Discriminant Analysis

  • Input (USPS handwritten digits):

    • Training set: 3000

  • Constructed:

    • 10 class/non-class KFD classifiers

    • Take the class with maximal output


Applications fisher discriminant analysis95 l.jpg
Applications – Fisher Discriminant Analysis

  • Results:3.7% error on a ten-class classifierUsing RBF with  = 0.3*256

  • Compare to 4.2% using SVM

  • KFDA vs. SVM


Slide96 l.jpg

Summary…

  • Structural Risk Minimization (SRM)

  • Support Vector Machines (SVM)

  • Feature Space vs. Input Space

  • Kernel PCA

  • Kernel Fisher Discriminate Analysis (KFDA)


ad