a kernel approach for learning from almost orthogonal pattern n.
Skip this Video
Download Presentation
A Kernel Approach for Learning From Almost Orthogonal Pattern *

Loading in 2 Seconds...

play fullscreen
1 / 26

A Kernel Approach for Learning From Almost Orthogonal Pattern * - PowerPoint PPT Presentation

  • Uploaded on

A Kernel Approach for Learning From Almost Orthogonal Pattern *. CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin. * B. Scholkopf et al ., Proc. 13 th ECML , Aug 19-23, 2002, pp. 511-528. Presentation Outline. Introduction Motivation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'A Kernel Approach for Learning From Almost Orthogonal Pattern *' - xanthus-church

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a kernel approach for learning from almost orthogonal pattern

A Kernel Approach for Learning From Almost Orthogonal Pattern*

CIS 525 Class Presentation

Professor: Slobodan Vucetic

Presenter: Yilian Qin

* B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

presentation outline
Presentation Outline
  • Introduction
    • Motivation
    • A Brief review of SVM for linearly separable patterns
    • Kernel approach for SVM
    • Empirical kernel map
  • Problem: almost orthogonal patterns in feature space
    • An example
    • Situations leading to almost orthogonal patterns
  • Method to reduce large diagonals of Gram matrix
    • Gram matrix transformation
    • An approximate approach based on statistics
  • Experiments
    • Artificial data (String classification, Microarray data with noise, Hidden variable problem)
    • Real data (Thrombin binding, Lymphoma classification, Protein family classification)
  • Conclusions
  • Comments
  • Support vector machine (SVM)
    • Powerful method for classification (or regression) with high accuracy comparable to neural network
    • Exploit of kernel function for pattern separation in high dimensional space
    • The information of training data for SVM is stored in the Gram matrix (kernel matrix)
  • The problem:
    • SVM doesn’t perform well if Gram matrix has large diagonal values
a brief review of svm





depends on closest points






A Brief Review of SVM

For linearly separable patterns:

To maximize the margin:



kernel approach for svm 1 3



Kernel Approach for SVM (1/3)
  • For linearly non-separable patterns
    • Nonlinear mapping function (x)H: mapping the patterns to new feature space H of higher dimension
    • For example: the XOR problem
    • SVM in the new feature space:
  • The kernel trick:
    • Solving the above minimization problem requires: 1) Explicit form of 

2) Inner product in high dimensional space H

    • Simplification by wise selection of kernel functions with property: k(xi, xj) = (xi)  (xj)
kernel approach for svm 2 3



Kernel Approach for SVM (2/3)
  • Transform the problem with kernel method
    • Expand w in the new feature space: w = ai(xi) = [(x)]awhere [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T
    • Gram matrix: K=[Kij], where Kij = (xi)  (xj) = k(xi, xj) (symmetric !)
    • The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite)
    • The constraints:yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} = yi{aTKi + b}  1, where Ki is the ith column of K.
kernel approach for svm 3 3

Where: a and b are optimal solution based on training data, and m is the number if training data

Kernel Approach for SVM (3/3)
  • To predict new data with a trained SVM
  • The explicit form of k(xi, xj) is required for prediction of new data
empirical kernel mapping



Empirical Kernel Mapping
  • Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m-dimension space (Rm)
  • Empirical kernel map: m(xi)= [k(xi,x1), k(xi,x2), …, k(xi,xm)]T =Ki
  • The SVM in Rm
  • The new Gram matrix Km associated withm(x):

Km=[Kmij], where Kmij = m(xi)  m(xj) = Ki  Kj = KiTKj, i.e.Km = KTK = KKT

  • Advantage of empirical kernel map: Km is positive definite
    • Km= KKT = (UTDU) (UTDU)T= UTD2U(K is symmetric, U is unitary matrix, D is diagonal)
    • Satisfied the sufficient condition of above minimization problem
The Problem:

Almost Orthogonal Patterns in the Feature Space

Result in Poor Performance

an example of almost orthogonal patterns
An Example of Almost Orthogonal Patterns
  • The Gram matrix with linear kernel k(xi, xj) = xi xj
  • The training dataset with almost orthogonal patterns

Large Diagonals

  • w is the solution with standard SVM
  • Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well
  • A better solution:
situations leading to almost orthogonal patterns
Situations Leading to Almost Orthogonal Patterns
  • Sparsity of the patterns in the new feature space, e.g.
    • x = [ 0, 0, 0, 1, 0, 0, 1, 0]T
    • Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T
    • x  x  y  y >> x  y (large diagonals in Gram matrix)
  • Some selection of kernel functions may result in sparsity in the new feature space
    • String kernel (Watkins 2000, et al)
    • Polynomial kernel, k(xi, xj) = (xixj)d, with large order d
      • If xi  xi > xi  xj , for ij, then
      • k(xi, xi) >> k(xi, xj), for even moderately large d, due to the exponential function.
gram matrix transformation 1 2
Gram Matrix Transformation (1/2)
  • For symmetric, positive definite Gram matrix K (or Km),
    • K = UTDU U is unitary matrix, D is diagonal matrix
    • Define f(K) = UTf(D)U, andf(D)ii = f(Dii) i.e., the function f operates on the eigenvalues i of K
    • f(K) should preserve positive definition of the Gram matrix
  • A sample procedure for Gram matrix transformation
    • (Optional) Compute the positive definite matrix A = sqrt(K)
    • Suppress the large diagonals of A, and obtain a symmetric A’

i.e. transform the eigenvalues of A:[min, max]  [f(min ), f(max )]

    • Compute the positive definite matrix K’=(A’)2
gram matrix transformation 2 2


(xi)  (xj)



Implicit transformation




k’(xi,xj) =

’(xi)  ’(xj)

a’ and b’ from the portion of K’

corresponding to the training data


If xi has been used in calculating K’,the prediction on xi can simply use K’i

i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data

Gram Matrix Transformation (2/2)
  • Effect of matrix transformation
    • The explicit form of new kernel function k’ is not available
    • k’ is required when the trained SVM is used to predict the testing data
    • A solution: include all test data into K before the matrix transformation K->K’i.e. the testing data has to be known in training time
an approximate approach based on statistics
An Approximate Approach based on Statistics
  • The empirical kernel map m+n(x) should be used to calculate the Gram matrix
  • Assuming the dataset size r is large
  • Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)
artificial data 1 3
Artificial Data (1/3)
  • String classification
    • String kernel function (Watkins 2000, et al)
    • Sub-polynomial kernel k(x,y) = [(x)  (y)]P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed
    • 50 strings (25 for training, and 25 for testing), 20 trials
artificial data 2 3
Artificial Data (2/3)
  • Microarray data with noise (Alon et al, 1999)
    • 62 instance (22 positive, 44 negative), 2000 features in original data
    • 10000 noise features were added (1% to be non-zero in probability)

Error rate for SVM without noise addition is: 0.180.15

artificial data 3 3
Artificial Data (3/3)
  • Hidden variable problem
    • 10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables
    • Original kernel is polynomial kernel of order 4
real data 1 3
Real Data (1/3)
  • Thrombin binding problem
    • 1909 instances, 139,351 binary features
    • 0.68% entries are non-zero
    • 8-fold cross validation
real data 2 3
Real Data (2/3)
  • Lymphoma classification (Alizadeh et al, 2000)
    • 96 samples, 4026 features
    • 10-fold cross validation
    • Improved results observed compared with previous work (Weston, 2001)
real data 3 3
Real Data (3/3)
  • Protein family classification (Murzin et al, 1995)
    • Small positive set, large negative set

Receiver operating characteristic

1: best score

0: worst score

Rate of false positive

  • Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed
  • The common situation that sparse vectors leading to large diagonals was identified and discussed
  • A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases
  • Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices
  • Strong points:
    • The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals
    • Experiments are extensive
  • Weak points:
    • The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time
    • The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments
    • The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied