Classification ( SVMs / Kernel method)

1 / 56

# Classification ( SVMs / Kernel method) - PowerPoint PPT Presentation

Classification ( SVMs / Kernel method). LP versus Quadratic programming. LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Classification ( SVMs / Kernel method)' - jackie

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Classification (SVMs / Kernel method)

Bafna/Ideker

LP: linear constraints, linear objective function

LP can be solved in polynomial time.

In QP, the objective function contains a quadratic form.

For +ve semindefinite Q, the QP can be solved in polynomial time

Bafna/Ideker

Margin of separation
• Suppose we find a separating hyperplane (, 0) s.t.
• For all +ve points x
• Tx-0>=1
• For all +ve points x
• Tx-0 <= -1
• What is the margin of separation?

Tx- 0=1

Tx- 0=0

Tx- 0=-1

Bafna/Ideker

Separating by a wider margin
• Solutions with a wider margin are better.

Bafna/Ideker

Separating via misclassification
• In general, data is not linearly separable
• What if we also wanted to minimize misclassified points
• Recall that, each sample xi in our training set has the label yi{-1,1}
• For each point i, yi(Txi-0) should be positive
• Define i >= max {0, 1- yi(Txi-0) }
• If i is correctly classified ( yi(Txi-0) >= 1), and i = 0
• If i is incorrectly classified, or close to the boundaries i > 0
• We must minimize ii

Bafna/Ideker

Support Vector machines (wide margin and misclassification)
• Maximimize margin while minimizing misclassification
• Solved using non-linear optimization techniques
• The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method.
• This gives a lot of power to the method.

Bafna/Ideker

Lagrangian relaxation
• Goal
• S.t.
• We minimize

Bafna/Ideker

Simplifying
• For fixed  >= 0,  >= 0, we minimize the lagrangian

Bafna/Ideker

Substituting
• Substituting (1)

Bafna/Ideker

Classification using SVMs
• Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques
• Quiz: When we have solved this QP, how do we classify a point x?

Bafna/Ideker

The kernel method
• The SVM formulation can be solved using QP on dot-products.
• As these are wide-margin classifiers, they provide a more robust solution.
• However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces)

Bafna/Ideker

kernel
• Let X be the set of objects
• Ex: X =the set of samples in micro-arrays.
• Each object xX is a vector of gene expression values
• k: X  X -> R is a positivesemidefinitekernel if
• k is symmetric.
• k is +vesemidefinite

Bafna/Ideker

Kernels as dot-product
• Quiz: Suppose the objects x are all real vectors (as in gene expression)
• Define
• Is kL a kernel? It is symmetric, but is is +ve semi-definite?

Bafna/Ideker

Linear kernel is +ve semidefinite
• Recall X as a matrix, such that each column is a sample
• X=[x1 x2 …]
• By definition, the linear kernel kL=XTX
• For any c

Bafna/Ideker

Generalizing kernels
• Any object can be represented by a feature vector in real space.

Bafna/Ideker

Generalizing
• Note that the feature mapping could actually be non-linear.
• On the flip side, Every kernel can be represented as a dot-product in a high dimensional space.
• Sometimes the kernel space is easier to define than the mapping 

Bafna/Ideker

The kernel trick
• If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel
• Simply replace the dot-product by the kernel

Bafna/Ideker

Kernel trick example
• Consider a kernel k defined on a mapping 
• k(x,x’) = (x)T (x’)
• It could be that  is very difficult to compute explicitly, but k is easy to compute
• Suppose we define a distance function between two objects as
• How do we compute this distance?

Bafna/Ideker

Kernels and SVMs
• Recall that SVM based classification is described as

Bafna/Ideker

Kernels and SVMs
• Applying the kernel trick
• We can try kernels that are biologically relevant

Bafna/Ideker

String kernel
• Consider a string s = s1, s2,…
• Define an index set I as a subset of indices
• s[I] is the substring limited to those indices
• l(I) = span
• W(I) = cl(I) c<1
• Weight decreases as span increases
• For any string u of length k

l(I)

Bafna/Ideker

String Kernel
• Map every string to a ||n dimensional space, indexed by all strings u of length upto n
• The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time

s

u

Bafna/Ideker

SVM conclusion
• SVM are a generic scheme for classifying data with wide margins and low misclassifications
• For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification
• Define a meaningful kernel, and solve using SVM
• Many standard kernels are available (linear, poly., RBF, string)

Bafna/Ideker

Classification review
• We started out by treating the classification problem as one of separating points in high dimensional space
• Obvious for gene expression data, but applicable to any kind of data
• Question of separability, linear separation
• Algorithms for classification
• Perceptron
• Lin. Discriminant
• Max Likelihood
• Linear Programming
• SVMs
• Kernel methods & SVM

Bafna/Ideker

Classification review
• Recall that we considered 3 problems:
• Group together samples in an unsupervised fashion (clustering)
• Classify based on a training data (often by learning a hyperplane that separates).
• Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.

Bafna/Ideker

Dimensionality reduction
• Many genes have highly correlated expression profiles.
• By discarding some of the genes, we can greatly reduce the dimensionality of the problem.
• There are other, more principled ways to do such dimensionality reduction.

Bafna/Ideker

• With a high enough dimensionality, all points can be linearly separated.
• Recall that a point xi is misclassified if
• it is +ve, but Txi-0<=0
• it is -ve, but Txi+0 > 0
• In the first case choose i s.t.
• Txi-0+i >= 0
• By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!

Bafna/Ideker

Principle Components Analysis
• We get the intrinsic dimensionality of a data-set.

Bafna/Ideker

Principle Components Analysis
• Consider the expression values of 2 genes over 6 samples.
• Clearly, the expression of the two genes is highly correlated.
• Projecting all the genes on a single line could explain most of the data.
• This is a generalization of “discarding the gene”.

Bafna/Ideker

Projecting
• Consider the mean of all points m, and a vector emanating from the mean
• Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)

m

x

x-m

=

T

T(x-m)

Bafna/Ideker

M

Higher dimensions
• Consider a set of 2 (k) orthonormal vectors 1, 2…
• Once projected, each samplemeans that all samples x can be represented by 2 (k) dimensional vector
• 1T(x-m), 2T(x-m)

2

1

m

x

1T(x-m)

x-m

1T

=

Bafna/Ideker

M

How to project
• The generic scheme allows us to project an m dimensional surface into a k dimensional one.
• How do we select the k ‘best’ dimensions?
• The strategy used by PCA is one that maximizes the variance of the projected points around the mean

Bafna/Ideker

PCA
• Suppose all of the data were to be reduced by projecting to a single line  from the mean.
• How do we select the line ?

m

Bafna/Ideker

PCA cont’d
• Let each point xk map to x’k=m+ak. We want to minimize the error
• Observation 1: Each point xk maps to x’k = m + T(xk-m)
• (ak= T(xk-m))

xk

x’k

m

Bafna/Ideker

Proof of Observation 1

Differentiating w.r.t ak

Bafna/Ideker

Minimizing PCA Error
• To minimize error, we must maximize TS
• By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector.
• Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.

Bafna/Ideker

PCA steps
• X = starting matrix with n columns, m rows

X

xj

Bafna/Ideker

End of Lecture

Bafna/Ideker

ALL-AML classification
• The two leukemias need different different therapeutic regimen.
• Usually distinguished through hematopathology
• Can Gene expression be used for a more definitive test?
• 38 bonemarrow samples
• Total mRNA was hybridized against probes for 6817 genes
• Q: Are these classes separable

Bafna/Ideker

Neighborhood analysis (cont’d)
• Each gene is represented by an expression vector v(g) = (e1,e2,…,en)
• Choose an idealized expression vector as center.
• Discriminating genes will be ‘closer’ to the center (any distance measure can be used).

Discriminating gene

Bafna/Ideker

Neighborhood analysis
• Q: Are there genes, whose expression correlates with one of the two classes
• A: For each class, create an idealized vector c
• Compute the number of genes Nc whose expression ‘matches’ the idealized expression vector
• Is Nc significantly larger thanNc* for a random c*?

Bafna/Ideker

Neighborhood test
• Distance measure used:
• For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2
• Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)].
• P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)]
• N1(c,r) = {g | P(g,c) == r}
• High density for some r is indicative of correlation with class distinction
• Neighborhood is significant if a random center does not produce the same density.

Bafna/Ideker

Neighborhood analysis
• #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance.
• Class prediction should be possible using micro-array expression values.

Bafna/Ideker

Class prediction
• Choose a fixed set of informative genes (based on their correlation with the class distinction).
• The predictor is uniquely defined by the sample and the subset of informative genes.
• For each informative gene g, define (wg,bg).
• wg=P(g,c) (When is this +ve?)
• bg = [1(g)+2(g)]/2
• Given a new sample X
• xg is the normalized expression value at g
• Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and negative for class 2)

Bafna/Ideker

Prediction Strength
• PS = [Vwin-Vlose]/[Vwin+Vlose]
• Reflects the margin of victory
• A 50 gene predictor is correct 36/38 (cross-validation)
• Prediction accuracy on other samples 100% (prediction made for 29/34 samples.
• Median PS = 0.73
• Other predictors between 10 and 200 genes all worked well.

Bafna/Ideker

Performance

Bafna/Ideker

Differentially expressed genes?
• Do the predictive genes reveal any biology?
• Initial expectation is that most genes would be of a hematopoetic lineage.
• However, many genes encode
• Cell cycle progression genes
• Chromatin remodelling
• Transcription
• Known oncogenes
• Leukemia targets (etopside)

Bafna/Ideker

Relationship between ML, and Golub predictor

ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier

Bafna/Ideker

Automatic class discovery
• The classification of different cancers is over years of hypothesis driven research.
• Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?

Bafna/Ideker

Self Organizing Maps
• SOMs was applied to group the 38 samples
• Class A1 contained 24/25 ALL and 3/13 AML samples.
• How can we validate this?
• Use the labels to do supervised classification via cross-validation
• A 20 gene predictor gave 34 accurate predictions, 1 error, and 2 of 3 uncertains

Bafna/Ideker

Conclusion

Bafna/Ideker