A Kernel Approach for Learning From Almost Orthogonal Pattern *

1 / 26

# A Kernel Approach for Learning From Almost Orthogonal Pattern * - PowerPoint PPT Presentation

A Kernel Approach for Learning From Almost Orthogonal Pattern *. CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin. * B. Scholkopf et al ., Proc. 13 th ECML , Aug 19-23, 2002, pp. 511-528. Presentation Outline. Introduction Motivation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'A Kernel Approach for Learning From Almost Orthogonal Pattern *' - xanthus-church

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A Kernel Approach for Learning From Almost Orthogonal Pattern*

CIS 525 Class Presentation

Professor: Slobodan Vucetic

Presenter: Yilian Qin

* B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

Presentation Outline
• Introduction
• Motivation
• A Brief review of SVM for linearly separable patterns
• Kernel approach for SVM
• Empirical kernel map
• Problem: almost orthogonal patterns in feature space
• An example
• Situations leading to almost orthogonal patterns
• Method to reduce large diagonals of Gram matrix
• Gram matrix transformation
• An approximate approach based on statistics
• Experiments
• Artificial data (String classification, Microarray data with noise, Hidden variable problem)
• Real data (Thrombin binding, Lymphoma classification, Protein family classification)
• Conclusions
Motivation
• Support vector machine (SVM)
• Powerful method for classification (or regression) with high accuracy comparable to neural network
• Exploit of kernel function for pattern separation in high dimensional space
• The information of training data for SVM is stored in the Gram matrix (kernel matrix)
• The problem:
• SVM doesn’t perform well if Gram matrix has large diagonal values

+

+

+

+

depends on closest points

+

+

+

+

margin

A Brief Review of SVM

For linearly separable patterns:

To maximize the margin:

Minimize:

Constraints:

Minimize:

Constraints:

Kernel Approach for SVM (1/3)
• For linearly non-separable patterns
• Nonlinear mapping function (x)H: mapping the patterns to new feature space H of higher dimension
• For example: the XOR problem
• SVM in the new feature space:
• The kernel trick:
• Solving the above minimization problem requires: 1) Explicit form of 

2) Inner product in high dimensional space H

• Simplification by wise selection of kernel functions with property: k(xi, xj) = (xi)  (xj)

Minimize:

Constraints:

Kernel Approach for SVM (2/3)
• Transform the problem with kernel method
• Expand w in the new feature space: w = ai(xi) = [(x)]awhere [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T
• Gram matrix: K=[Kij], where Kij = (xi)  (xj) = k(xi, xj) (symmetric !)
• The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite)
• The constraints:yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} = yi{aTKi + b}  1, where Ki is the ith column of K.

Where: a and b are optimal solution based on training data, and m is the number if training data

Kernel Approach for SVM (3/3)
• To predict new data with a trained SVM
• The explicit form of k(xi, xj) is required for prediction of new data

Minimize:

Constraints:

Empirical Kernel Mapping
• Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m-dimension space (Rm)
• Empirical kernel map: m(xi)= [k(xi,x1), k(xi,x2), …, k(xi,xm)]T =Ki
• The SVM in Rm
• The new Gram matrix Km associated withm(x):

Km=[Kmij], where Kmij = m(xi)  m(xj) = Ki  Kj = KiTKj, i.e.Km = KTK = KKT

• Advantage of empirical kernel map: Km is positive definite
• Km= KKT = (UTDU) (UTDU)T= UTD2U(K is symmetric, U is unitary matrix, D is diagonal)
• Satisfied the sufficient condition of above minimization problem
The Problem:

Almost Orthogonal Patterns in the Feature Space

Result in Poor Performance

An Example of Almost Orthogonal Patterns
• The Gram matrix with linear kernel k(xi, xj) = xi xj
• The training dataset with almost orthogonal patterns

Large Diagonals

• w is the solution with standard SVM
• Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well
• A better solution:
Situations Leading to Almost Orthogonal Patterns
• Sparsity of the patterns in the new feature space, e.g.
• x = [ 0, 0, 0, 1, 0, 0, 1, 0]T
• Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T
• x  x  y  y >> x  y (large diagonals in Gram matrix)
• Some selection of kernel functions may result in sparsity in the new feature space
• String kernel (Watkins 2000, et al)
• Polynomial kernel, k(xi, xj) = (xixj)d, with large order d
• If xi  xi > xi  xj , for ij, then
• k(xi, xi) >> k(xi, xj), for even moderately large d, due to the exponential function.
Gram Matrix Transformation (1/2)
• For symmetric, positive definite Gram matrix K (or Km),
• K = UTDU U is unitary matrix, D is diagonal matrix
• Define f(K) = UTf(D)U, andf(D)ii = f(Dii) i.e., the function f operates on the eigenvalues i of K
• f(K) should preserve positive definition of the Gram matrix
• A sample procedure for Gram matrix transformation
• (Optional) Compute the positive definite matrix A = sqrt(K)
• Suppress the large diagonals of A, and obtain a symmetric A’

i.e. transform the eigenvalues of A:[min, max]  [f(min ), f(max )]

• Compute the positive definite matrix K’=(A’)2

k(xi,xj)=

(xi)  (xj)

(x)

K

Implicit transformation

f(K)

’(x)

K’

k’(xi,xj) =

’(xi)  ’(xj)

a’ and b’ from the portion of K’

corresponding to the training data

K’=f(K)

If xi has been used in calculating K’,the prediction on xi can simply use K’i

i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data

Gram Matrix Transformation (2/2)
• Effect of matrix transformation
• The explicit form of new kernel function k’ is not available
• k’ is required when the trained SVM is used to predict the testing data
• A solution: include all test data into K before the matrix transformation K->K’i.e. the testing data has to be known in training time
An Approximate Approach based on Statistics
• The empirical kernel map m+n(x) should be used to calculate the Gram matrix
• Assuming the dataset size r is large
• Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)
Artificial Data (1/3)
• String classification
• String kernel function (Watkins 2000, et al)
• Sub-polynomial kernel k(x,y) = [(x)  (y)]P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed
• 50 strings (25 for training, and 25 for testing), 20 trials
Artificial Data (2/3)
• Microarray data with noise (Alon et al, 1999)
• 62 instance (22 positive, 44 negative), 2000 features in original data
• 10000 noise features were added (1% to be non-zero in probability)

Error rate for SVM without noise addition is: 0.180.15

Artificial Data (3/3)
• Hidden variable problem
• 10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables
• Original kernel is polynomial kernel of order 4
Real Data (1/3)
• Thrombin binding problem
• 1909 instances, 139,351 binary features
• 0.68% entries are non-zero
• 8-fold cross validation
Real Data (2/3)
• Lymphoma classification (Alizadeh et al, 2000)
• 96 samples, 4026 features
• 10-fold cross validation
• Improved results observed compared with previous work (Weston, 2001)
Real Data (3/3)
• Protein family classification (Murzin et al, 1995)
• Small positive set, large negative set

1: best score

0: worst score

Rate of false positive

Conclusions
• Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed
• The common situation that sparse vectors leading to large diagonals was identified and discussed
• A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases
• Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices