1 / 56

Classification ( SVMs / Kernel method)

Classification ( SVMs / Kernel method). LP versus Quadratic programming. LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time.

jackie
Download Presentation

Classification ( SVMs / Kernel method)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification (SVMs / Kernel method) Bafna/Ideker

  2. LP versus Quadratic programming LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time Bafna/Ideker

  3. Margin of separation • Suppose we find a separating hyperplane (, 0) s.t. • For all +ve points x • Tx-0>=1 • For all +ve points x • Tx-0 <= -1 • What is the margin of separation? Tx- 0=1 Tx- 0=0 Tx- 0=-1 Bafna/Ideker

  4. Separating by a wider margin • Solutions with a wider margin are better. Bafna/Ideker

  5. Separating via misclassification • In general, data is not linearly separable • What if we also wanted to minimize misclassified points • Recall that, each sample xi in our training set has the label yi{-1,1} • For each point i, yi(Txi-0) should be positive • Define i >= max {0, 1- yi(Txi-0) } • If i is correctly classified ( yi(Txi-0) >= 1), and i = 0 • If i is incorrectly classified, or close to the boundaries i > 0 • We must minimize ii Bafna/Ideker

  6. Support Vector machines (wide margin and misclassification) • Maximimize margin while minimizing misclassification • Solved using non-linear optimization techniques • The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method. • This gives a lot of power to the method. Bafna/Ideker

  7. Reformulating the optimization Bafna/Ideker

  8. Lagrangian relaxation • Goal • S.t. • We minimize Bafna/Ideker

  9. Simplifying • For fixed  >= 0,  >= 0, we minimize the lagrangian Bafna/Ideker

  10. Substituting • Substituting (1) Bafna/Ideker

  11. Substituting (2,3), we have the minimization problem Bafna/Ideker

  12. Classification using SVMs • Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques • Quiz: When we have solved this QP, how do we classify a point x? Bafna/Ideker

  13. The kernel method • The SVM formulation can be solved using QP on dot-products. • As these are wide-margin classifiers, they provide a more robust solution. • However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces) Bafna/Ideker

  14. kernel • Let X be the set of objects • Ex: X =the set of samples in micro-arrays. • Each object xX is a vector of gene expression values • k: X  X -> R is a positivesemidefinitekernel if • k is symmetric. • k is +vesemidefinite Bafna/Ideker

  15. Kernels as dot-product • Quiz: Suppose the objects x are all real vectors (as in gene expression) • Define • Is kL a kernel? It is symmetric, but is is +ve semi-definite? Bafna/Ideker

  16. Linear kernel is +ve semidefinite • Recall X as a matrix, such that each column is a sample • X=[x1 x2 …] • By definition, the linear kernel kL=XTX • For any c Bafna/Ideker

  17. Generalizing kernels • Any object can be represented by a feature vector in real space. Bafna/Ideker

  18. Generalizing • Note that the feature mapping could actually be non-linear. • On the flip side, Every kernel can be represented as a dot-product in a high dimensional space. • Sometimes the kernel space is easier to define than the mapping  Bafna/Ideker

  19. The kernel trick • If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel • Simply replace the dot-product by the kernel Bafna/Ideker

  20. Kernel trick example • Consider a kernel k defined on a mapping  • k(x,x’) = (x)T (x’) • It could be that  is very difficult to compute explicitly, but k is easy to compute • Suppose we define a distance function between two objects as • How do we compute this distance? Bafna/Ideker

  21. Kernels and SVMs • Recall that SVM based classification is described as Bafna/Ideker

  22. Kernels and SVMs • Applying the kernel trick • We can try kernels that are biologically relevant Bafna/Ideker

  23. Examples of kernels for vectors Bafna/Ideker

  24. String kernel • Consider a string s = s1, s2,… • Define an index set I as a subset of indices • s[I] is the substring limited to those indices • l(I) = span • W(I) = cl(I) c<1 • Weight decreases as span increases • For any string u of length k l(I) Bafna/Ideker

  25. String Kernel • Map every string to a ||n dimensional space, indexed by all strings u of length upto n • The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time s u Bafna/Ideker

  26. SVM conclusion • SVM are a generic scheme for classifying data with wide margins and low misclassifications • For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification • Define a meaningful kernel, and solve using SVM • Many standard kernels are available (linear, poly., RBF, string) Bafna/Ideker

  27. Classification review • We started out by treating the classification problem as one of separating points in high dimensional space • Obvious for gene expression data, but applicable to any kind of data • Question of separability, linear separation • Algorithms for classification • Perceptron • Lin. Discriminant • Max Likelihood • Linear Programming • SVMs • Kernel methods & SVM Bafna/Ideker

  28. Classification review • Recall that we considered 3 problems: • Group together samples in an unsupervised fashion (clustering) • Classify based on a training data (often by learning a hyperplane that separates). • Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality. Bafna/Ideker

  29. Dimensionality reduction • Many genes have highly correlated expression profiles. • By discarding some of the genes, we can greatly reduce the dimensionality of the problem. • There are other, more principled ways to do such dimensionality reduction. Bafna/Ideker

  30. Why is high dimensionality bad? • With a high enough dimensionality, all points can be linearly separated. • Recall that a point xi is misclassified if • it is +ve, but Txi-0<=0 • it is -ve, but Txi+0 > 0 • In the first case choose i s.t. • Txi-0+i >= 0 • By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points! Bafna/Ideker

  31. Principle Components Analysis • We get the intrinsic dimensionality of a data-set. Bafna/Ideker

  32. Principle Components Analysis • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of the two genes is highly correlated. • Projecting all the genes on a single line could explain most of the data. • This is a generalization of “discarding the gene”. Bafna/Ideker

  33. Projecting • Consider the mean of all points m, and a vector emanating from the mean • Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)  m x x-m = T T(x-m) Bafna/Ideker M

  34. Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… • Once projected, each samplemeans that all samples x can be represented by 2 (k) dimensional vector • 1T(x-m), 2T(x-m) 2 1 m x 1T(x-m) x-m 1T = Bafna/Ideker M

  35. How to project • The generic scheme allows us to project an m dimensional surface into a k dimensional one. • How do we select the k ‘best’ dimensions? • The strategy used by PCA is one that maximizes the variance of the projected points around the mean Bafna/Ideker

  36. PCA • Suppose all of the data were to be reduced by projecting to a single line  from the mean. • How do we select the line ? m Bafna/Ideker

  37. PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize the error • Observation 1: Each point xk maps to x’k = m + T(xk-m) • (ak= T(xk-m)) xk  x’k m Bafna/Ideker

  38. Proof of Observation 1 Differentiating w.r.t ak Bafna/Ideker

  39. Minimizing PCA Error • To minimize error, we must maximize TS • By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector. • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue. Bafna/Ideker

  40. PCA steps • X = starting matrix with n columns, m rows X xj Bafna/Ideker

  41. End of Lecture Bafna/Ideker

  42. Bafna/Ideker

  43. ALL-AML classification • The two leukemias need different different therapeutic regimen. • Usually distinguished through hematopathology • Can Gene expression be used for a more definitive test? • 38 bonemarrow samples • Total mRNA was hybridized against probes for 6817 genes • Q: Are these classes separable Bafna/Ideker

  44. Neighborhood analysis (cont’d) • Each gene is represented by an expression vector v(g) = (e1,e2,…,en) • Choose an idealized expression vector as center. • Discriminating genes will be ‘closer’ to the center (any distance measure can be used). Discriminating gene Bafna/Ideker

  45. Neighborhood analysis • Q: Are there genes, whose expression correlates with one of the two classes • A: For each class, create an idealized vector c • Compute the number of genes Nc whose expression ‘matches’ the idealized expression vector • Is Nc significantly larger thanNc* for a random c*? Bafna/Ideker

  46. Neighborhood test • Distance measure used: • For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2 • Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)]. • P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)] • N1(c,r) = {g | P(g,c) == r} • High density for some r is indicative of correlation with class distinction • Neighborhood is significant if a random center does not produce the same density. Bafna/Ideker

  47. Neighborhood analysis • #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance. • Class prediction should be possible using micro-array expression values. Bafna/Ideker

  48. Class prediction • Choose a fixed set of informative genes (based on their correlation with the class distinction). • The predictor is uniquely defined by the sample and the subset of informative genes. • For each informative gene g, define (wg,bg). • wg=P(g,c) (When is this +ve?) • bg = [1(g)+2(g)]/2 • Given a new sample X • xg is the normalized expression value at g • Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and negative for class 2) Bafna/Ideker

  49. Prediction Strength • PS = [Vwin-Vlose]/[Vwin+Vlose] • Reflects the margin of victory • A 50 gene predictor is correct 36/38 (cross-validation) • Prediction accuracy on other samples 100% (prediction made for 29/34 samples. • Median PS = 0.73 • Other predictors between 10 and 200 genes all worked well. Bafna/Ideker

  50. Performance Bafna/Ideker

More Related