Classification svms kernel method
1 / 56

Classification ( SVMs / Kernel method) - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Classification ( SVMs / Kernel method). LP versus Quadratic programming. LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Classification ( SVMs / Kernel method)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Classification (SVMs / Kernel method)


LP versus Quadratic programming

LP: linear constraints, linear objective function

LP can be solved in polynomial time.

In QP, the objective function contains a quadratic form.

For +ve semindefinite Q, the QP can be solved in polynomial time


Margin of separation

  • Suppose we find a separating hyperplane (, 0) s.t.

    • For all +ve points x

      • Tx-0>=1

    • For all +ve points x

      • Tx-0 <= -1

  • What is the margin of separation?

Tx- 0=1

Tx- 0=0

Tx- 0=-1


Separating by a wider margin

  • Solutions with a wider margin are better.


Separating via misclassification

  • In general, data is not linearly separable

  • What if we also wanted to minimize misclassified points

  • Recall that, each sample xi in our training set has the label yi{-1,1}

  • For each point i, yi(Txi-0) should be positive

  • Define i >= max {0, 1- yi(Txi-0) }

  • If i is correctly classified ( yi(Txi-0) >= 1), and i = 0

  • If i is incorrectly classified, or close to the boundaries i > 0

  • We must minimize ii


Support Vector machines (wide margin and misclassification)

  • Maximimize margin while minimizing misclassification

  • Solved using non-linear optimization techniques

  • The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method.

  • This gives a lot of power to the method.


Reformulating the optimization


Lagrangian relaxation

  • Goal

  • S.t.

  • We minimize



  • For fixed  >= 0,  >= 0, we minimize the lagrangian



  • Substituting (1)


  • Substituting (2,3), we have the minimization problem


Classification using SVMs

  • Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques

  • Quiz: When we have solved this QP, how do we classify a point x?


The kernel method

  • The SVM formulation can be solved using QP on dot-products.

  • As these are wide-margin classifiers, they provide a more robust solution.

  • However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces)



  • Let X be the set of objects

    • Ex: X =the set of samples in micro-arrays.

    • Each object xX is a vector of gene expression values

  • k: X  X -> R is a positivesemidefinitekernel if

    • k is symmetric.

    • k is +vesemidefinite


Kernels as dot-product

  • Quiz: Suppose the objects x are all real vectors (as in gene expression)

  • Define

  • Is kL a kernel? It is symmetric, but is is +ve semi-definite?


Linear kernel is +ve semidefinite

  • Recall X as a matrix, such that each column is a sample

    • X=[x1 x2 …]

  • By definition, the linear kernel kL=XTX

  • For any c


Generalizing kernels

  • Any object can be represented by a feature vector in real space.



  • Note that the feature mapping could actually be non-linear.

  • On the flip side, Every kernel can be represented as a dot-product in a high dimensional space.

  • Sometimes the kernel space is easier to define than the mapping 


The kernel trick

  • If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel

    • Simply replace the dot-product by the kernel


Kernel trick example

  • Consider a kernel k defined on a mapping 

    • k(x,x’) = (x)T (x’)

  • It could be that  is very difficult to compute explicitly, but k is easy to compute

  • Suppose we define a distance function between two objects as

  • How do we compute this distance?


Kernels and SVMs

  • Recall that SVM based classification is described as


Kernels and SVMs

  • Applying the kernel trick

  • We can try kernels that are biologically relevant


Examples of kernels for vectors


String kernel

  • Consider a string s = s1, s2,…

  • Define an index set I as a subset of indices

  • s[I] is the substring limited to those indices

  • l(I) = span

  • W(I) = cl(I) c<1

    • Weight decreases as span increases

  • For any string u of length k



String Kernel

  • Map every string to a ||n dimensional space, indexed by all strings u of length upto n

  • The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time




SVM conclusion

  • SVM are a generic scheme for classifying data with wide margins and low misclassifications

  • For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification

    • Define a meaningful kernel, and solve using SVM

  • Many standard kernels are available (linear, poly., RBF, string)


Classification review

  • We started out by treating the classification problem as one of separating points in high dimensional space

  • Obvious for gene expression data, but applicable to any kind of data

  • Question of separability, linear separation

  • Algorithms for classification

    • Perceptron

    • Lin. Discriminant

    • Max Likelihood

    • Linear Programming

    • SVMs

    • Kernel methods & SVM


Classification review

  • Recall that we considered 3 problems:

    • Group together samples in an unsupervised fashion (clustering)

    • Classify based on a training data (often by learning a hyperplane that separates).

    • Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.


Dimensionality reduction

  • Many genes have highly correlated expression profiles.

  • By discarding some of the genes, we can greatly reduce the dimensionality of the problem.

  • There are other, more principled ways to do such dimensionality reduction.


Why is high dimensionality bad?

  • With a high enough dimensionality, all points can be linearly separated.

  • Recall that a point xi is misclassified if

    • it is +ve, but Txi-0<=0

    • it is -ve, but Txi+0 > 0

  • In the first case choose i s.t.

    • Txi-0+i >= 0

  • By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!


Principle Components Analysis

  • We get the intrinsic dimensionality of a data-set.


Principle Components Analysis

  • Consider the expression values of 2 genes over 6 samples.

  • Clearly, the expression of the two genes is highly correlated.

  • Projecting all the genes on a single line could explain most of the data.

  • This is a generalization of “discarding the gene”.



  • Consider the mean of all points m, and a vector emanating from the mean

  • Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)









Higher dimensions

  • Consider a set of 2 (k) orthonormal vectors 1, 2…

  • Once projected, each samplemeans that all samples x can be represented by 2 (k) dimensional vector

    • 1T(x-m), 2T(x-m)











How to project

  • The generic scheme allows us to project an m dimensional surface into a k dimensional one.

  • How do we select the k ‘best’ dimensions?

  • The strategy used by PCA is one that maximizes the variance of the projected points around the mean



  • Suppose all of the data were to be reduced by projecting to a single line  from the mean.

  • How do we select the line ?



PCA cont’d

  • Let each point xk map to x’k=m+ak. We want to minimize the error

  • Observation 1: Each point xk maps to x’k = m + T(xk-m)

    • (ak= T(xk-m))





Proof of Observation 1

Differentiating w.r.t ak


Minimizing PCA Error

  • To minimize error, we must maximize TS

  • By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector.

  • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.


PCA steps

  • X = starting matrix with n columns, m rows




End of Lecture



ALL-AML classification

  • The two leukemias need different different therapeutic regimen.

  • Usually distinguished through hematopathology

  • Can Gene expression be used for a more definitive test?

    • 38 bonemarrow samples

    • Total mRNA was hybridized against probes for 6817 genes

    • Q: Are these classes separable


Neighborhood analysis (cont’d)

  • Each gene is represented by an expression vector v(g) = (e1,e2,…,en)

  • Choose an idealized expression vector as center.

  • Discriminating genes will be ‘closer’ to the center (any distance measure can be used).

Discriminating gene


Neighborhood analysis

  • Q: Are there genes, whose expression correlates with one of the two classes

  • A: For each class, create an idealized vector c

    • Compute the number of genes Nc whose expression ‘matches’ the idealized expression vector

    • Is Nc significantly larger thanNc* for a random c*?


Neighborhood test

  • Distance measure used:

    • For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2

    • Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)].

    • P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)]

    • N1(c,r) = {g | P(g,c) == r}

    • High density for some r is indicative of correlation with class distinction

    • Neighborhood is significant if a random center does not produce the same density.


Neighborhood analysis

  • #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance.

  • Class prediction should be possible using micro-array expression values.


Class prediction

  • Choose a fixed set of informative genes (based on their correlation with the class distinction).

    • The predictor is uniquely defined by the sample and the subset of informative genes.

  • For each informative gene g, define (wg,bg).

    • wg=P(g,c) (When is this +ve?)

    • bg = [1(g)+2(g)]/2

  • Given a new sample X

    • xg is the normalized expression value at g

    • Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and negative for class 2)


Prediction Strength

  • PS = [Vwin-Vlose]/[Vwin+Vlose]

    • Reflects the margin of victory

  • A 50 gene predictor is correct 36/38 (cross-validation)

  • Prediction accuracy on other samples 100% (prediction made for 29/34 samples.

  • Median PS = 0.73

  • Other predictors between 10 and 200 genes all worked well.




Differentially expressed genes?

  • Do the predictive genes reveal any biology?

  • Initial expectation is that most genes would be of a hematopoetic lineage.

  • However, many genes encode

    • Cell cycle progression genes

    • Chromatin remodelling

    • Transcription

    • Known oncogenes

    • Leukemia targets (etopside)


Relationship between ML, and Golub predictor

ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier


Automatic class discovery

  • The classification of different cancers is over years of hypothesis driven research.

  • Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?


Self Organizing Maps

  • SOMs was applied to group the 38 samples

  • Class A1 contained 24/25 ALL and 3/13 AML samples.

  • How can we validate this?

  • Use the labels to do supervised classification via cross-validation

  • A 20 gene predictor gave 34 accurate predictions, 1 error, and 2 of 3 uncertains


Comparing various error models




  • Login