Classification ( SVMs / Kernel method)

Classification (SVMs / Kernel method) Bafna/Ideker

LP versus Quadratic programming LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time Bafna/Ideker

Margin of separation • Suppose we find a separating hyperplane (, 0) s.t. • For all +ve points x • Tx-0>=1 • For all +ve points x • Tx-0 <= -1 • What is the margin of separation? Tx- 0=1 Tx- 0=0 Tx- 0=-1 Bafna/Ideker

Separating by a wider margin • Solutions with a wider margin are better. Bafna/Ideker

Separating via misclassification • In general, data is not linearly separable • What if we also wanted to minimize misclassified points • Recall that, each sample xi in our training set has the label yi{-1,1} • For each point i, yi(Txi-0) should be positive • Define i >= max {0, 1- yi(Txi-0) } • If i is correctly classified ( yi(Txi-0) >= 1), and i = 0 • If i is incorrectly classified, or close to the boundaries i > 0 • We must minimize ii Bafna/Ideker

Support Vector machines (wide margin and misclassification) • Maximimize margin while minimizing misclassification • Solved using non-linear optimization techniques • The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method. • This gives a lot of power to the method. Bafna/Ideker

Reformulating the optimization Bafna/Ideker

Lagrangian relaxation • Goal • S.t. • We minimize Bafna/Ideker

Simplifying • For fixed  >= 0,  >= 0, we minimize the lagrangian Bafna/Ideker

Substituting • Substituting (1) Bafna/Ideker

Substituting (2,3), we have the minimization problem Bafna/Ideker

Classification using SVMs • Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques • Quiz: When we have solved this QP, how do we classify a point x? Bafna/Ideker

The kernel method • The SVM formulation can be solved using QP on dot-products. • As these are wide-margin classifiers, they provide a more robust solution. • However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces) Bafna/Ideker

kernel • Let X be the set of objects • Ex: X =the set of samples in micro-arrays. • Each object xX is a vector of gene expression values • k: X  X -> R is a positivesemidefinitekernel if • k is symmetric. • k is +vesemidefinite Bafna/Ideker

Kernels as dot-product • Quiz: Suppose the objects x are all real vectors (as in gene expression) • Define • Is kL a kernel? It is symmetric, but is is +ve semi-definite? Bafna/Ideker

Linear kernel is +ve semidefinite • Recall X as a matrix, such that each column is a sample • X=[x1 x2 …] • By definition, the linear kernel kL=XTX • For any c Bafna/Ideker

Generalizing kernels • Any object can be represented by a feature vector in real space. Bafna/Ideker

Generalizing • Note that the feature mapping could actually be non-linear. • On the flip side, Every kernel can be represented as a dot-product in a high dimensional space. • Sometimes the kernel space is easier to define than the mapping  Bafna/Ideker

The kernel trick • If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel • Simply replace the dot-product by the kernel Bafna/Ideker

Kernel trick example • Consider a kernel k defined on a mapping  • k(x,x’) = (x)T (x’) • It could be that  is very difficult to compute explicitly, but k is easy to compute • Suppose we define a distance function between two objects as • How do we compute this distance? Bafna/Ideker

Kernels and SVMs • Recall that SVM based classification is described as Bafna/Ideker

Kernels and SVMs • Applying the kernel trick • We can try kernels that are biologically relevant Bafna/Ideker

Examples of kernels for vectors Bafna/Ideker

String kernel • Consider a string s = s1, s2,… • Define an index set I as a subset of indices • s[I] is the substring limited to those indices • l(I) = span • W(I) = cl(I) c<1 • Weight decreases as span increases • For any string u of length k l(I) Bafna/Ideker

String Kernel • Map every string to a ||n dimensional space, indexed by all strings u of length upto n • The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time s u Bafna/Ideker

SVM conclusion • SVM are a generic scheme for classifying data with wide margins and low misclassifications • For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification • Define a meaningful kernel, and solve using SVM • Many standard kernels are available (linear, poly., RBF, string) Bafna/Ideker

Classification review • We started out by treating the classification problem as one of separating points in high dimensional space • Obvious for gene expression data, but applicable to any kind of data • Question of separability, linear separation • Algorithms for classification • Perceptron • Lin. Discriminant • Max Likelihood • Linear Programming • SVMs • Kernel methods & SVM Bafna/Ideker

Classification review • Recall that we considered 3 problems: • Group together samples in an unsupervised fashion (clustering) • Classify based on a training data (often by learning a hyperplane that separates). • Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality. Bafna/Ideker

Dimensionality reduction • Many genes have highly correlated expression profiles. • By discarding some of the genes, we can greatly reduce the dimensionality of the problem. • There are other, more principled ways to do such dimensionality reduction. Bafna/Ideker

Why is high dimensionality bad? • With a high enough dimensionality, all points can be linearly separated. • Recall that a point xi is misclassified if • it is +ve, but Txi-0<=0 • it is -ve, but Txi+0 > 0 • In the first case choose i s.t. • Txi-0+i >= 0 • By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points! Bafna/Ideker

Principle Components Analysis • We get the intrinsic dimensionality of a data-set. Bafna/Ideker

Principle Components Analysis • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of the two genes is highly correlated. • Projecting all the genes on a single line could explain most of the data. • This is a generalization of “discarding the gene”. Bafna/Ideker

Projecting • Consider the mean of all points m, and a vector emanating from the mean • Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)  m x x-m = T T(x-m) Bafna/Ideker M

Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… • Once projected, each samplemeans that all samples x can be represented by 2 (k) dimensional vector • 1T(x-m), 2T(x-m) 2 1 m x 1T(x-m) x-m 1T = Bafna/Ideker M

How to project • The generic scheme allows us to project an m dimensional surface into a k dimensional one. • How do we select the k ‘best’ dimensions? • The strategy used by PCA is one that maximizes the variance of the projected points around the mean Bafna/Ideker

PCA • Suppose all of the data were to be reduced by projecting to a single line  from the mean. • How do we select the line ? m Bafna/Ideker

PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize the error • Observation 1: Each point xk maps to x’k = m + T(xk-m) • (ak= T(xk-m)) xk  x’k m Bafna/Ideker

Proof of Observation 1 Differentiating w.r.t ak Bafna/Ideker

Minimizing PCA Error • To minimize error, we must maximize TS • By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector. • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue. Bafna/Ideker

PCA steps • X = starting matrix with n columns, m rows X xj Bafna/Ideker

End of Lecture Bafna/Ideker

Bafna/Ideker

ALL-AML classification • The two leukemias need different different therapeutic regimen. • Usually distinguished through hematopathology • Can Gene expression be used for a more definitive test? • 38 bonemarrow samples • Total mRNA was hybridized against probes for 6817 genes • Q: Are these classes separable Bafna/Ideker

Neighborhood analysis (cont’d) • Each gene is represented by an expression vector v(g) = (e1,e2,…,en) • Choose an idealized expression vector as center. • Discriminating genes will be ‘closer’ to the center (any distance measure can be used). Discriminating gene Bafna/Ideker

Neighborhood analysis • Q: Are there genes, whose expression correlates with one of the two classes • A: For each class, create an idealized vector c • Compute the number of genes Nc whose expression ‘matches’ the idealized expression vector • Is Nc significantly larger thanNc* for a random c*? Bafna/Ideker

Neighborhood test • Distance measure used: • For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2 • Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)]. • P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)] • N1(c,r) = {g | P(g,c) == r} • High density for some r is indicative of correlation with class distinction • Neighborhood is significant if a random center does not produce the same density. Bafna/Ideker

Neighborhood analysis • #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance. • Class prediction should be possible using micro-array expression values. Bafna/Ideker

Class prediction • Choose a fixed set of informative genes (based on their correlation with the class distinction). • The predictor is uniquely defined by the sample and the subset of informative genes. • For each informative gene g, define (wg,bg). • wg=P(g,c) (When is this +ve?) • bg = [1(g)+2(g)]/2 • Given a new sample X • xg is the normalized expression value at g • Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and negative for class 2) Bafna/Ideker

Prediction Strength • PS = [Vwin-Vlose]/[Vwin+Vlose] • Reflects the margin of victory • A 50 gene predictor is correct 36/38 (cross-validation) • Prediction accuracy on other samples 100% (prediction made for 29/34 samples. • Median PS = 0.73 • Other predictors between 10 and 200 genes all worked well. Bafna/Ideker

Performance Bafna/Ideker

Classification ( SVMs / Kernel method)