1 / 50

Microarray Analysis: Gene Expression Classification

This course covers topics on gene regulation, proteomic and transcript profiling, and classification algorithms for gene expression analysis using microarray data.

Download Presentation

Microarray Analysis: Gene Expression Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. L15:Microarray analysis (Classification) Bafna

  2. Final exam syllabus • Take home • The entire course, but emphasis will be given to post-midterm lectures • HMMs, • Gene-finding, • mass spectrometry, • Micro-array analysis,

  3. Biol. Data analysis: Review Assembly Protein Sequence Analysis Sequence Analysis/ DNA signals Gene Finding Bafna

  4. Other static analysis is possible Genomic Analysis/ Pop. Genetics Assembly Protein Sequence Analysis Sequence Analysis Gene Finding Bafna ncRNA

  5. A Static picture of the cell is insufficient • Each Cell is continuously active, • Genes are being transcribed into RNA • RNA is translated into proteins • Proteins are PT modified and transported • Proteins perform various cellular functions • Can we probe the Cell dynamically? • Which transcripts are active? • Which proteins are active? • Which proteins interact? Gene Regulation Proteomic profiling Transcript profiling Bafna

  6. Protein quantification viaLC-MS Maps Peptide 2 I Peptide 1 m/z time Peptide 2 elution • A peptide/feature can be labeled with the triple (M,T,I): • monoisotopic M/Z, centroid retention time, and intensity • An LC-MS map is a collection of features x x x x x x x x x x x x x x x x x x x x m/z time Bafna

  7. Map 1 (normal) Map 2 (diseased) LC-Map Comparison for Quantification Bafna

  8. Quantitation: transcript versus Protein Expression Sample 1 Sample2 Protein 1 35 4 Protein 2 Protein 3 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins Bafna

  9. Transcript quantification • Instead of counting protein molecules, we count active transcripts in the cell. Bafna

  10. Transcript quantification with RNA-seq Bafna

  11. Quantification via microarrays Bafna

  12. Quantitation: transcript versus Protein Expression Sample 1 Sample2 Sample 1 Sample 2 Protein 1 35 4 100 20 mRNA1 Protein 2 mRNA1 Protein 3 mRNA1 mRNA1 mRNA1 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins Bafna

  13. Gene Expression Data • Gene Expression data: • Each row corresponds to a gene • Each column corresponds to an expression value • Can we separate the experiments into two or more classes? • Given a training set of two classes, can we build a classifier that places a new experiment in one of the two classes. s1 s2 s g Bafna

  14. Formalizing Classification • Classification problem: Find a surface (hyperplane) that will separate the classes • Given a new sample point, its class is then determined by which side of the surface it lies on. • How do we find the hyperplane? How do we find the side that a point lies on? 1 2 3 4 5 6 1 2 g1 3 1 .9 .8 .1 .2 .1 g2 .1 0 .2 .8 .7 .9 Bafna

  15. Basic geometry • What is ||x||2 ? • What is x/||x|| • Dot product? x=(x1,x2) y Bafna

  16. Dot Product • Let  be a unit vector. • |||| = 1 • Recall that • Tx = ||x|| cos  • What is Tx if x is orthogonal (perpendicular) to ? x   Tx = ||x|| cos  Bafna

  17. Find the unit vector that is perpendicular (normal to the hyperplane) Hyperplane • How can we define a hyperplane L? Bafna

  18. Points on the hyperplane • Consider a hyperplane L defined by unit vector , and distance 0 • Notes; • For all x  L, xTmust be the same, xT = 0 • For any two points x1, x2, • (x1- x2)T =0 x2 x1 Bafna

  19. Hyperplane properties • Given an arbitrary point x, what is the distance from x to the plane L? • D(x,L) = (Tx -0) • When are points x1 and x2 on different sides of the hyperplane? x 0 Bafna

  20. + x2 - x1 Separating by a hyperplane • Input: A training set of +ve & -ve examples • Goal: Find a hyperplane that separates the two classes. • Classification: A new point x is +ve if it lies on the +ve side of the hyperplane, -ve otherwise. • The hyperplane is represented by the line • {x:-0+1x1+2x2=0} Bafna

  21. +  x2 - x1 Error in classification • An arbitrarily chosen hyperplane might not separate the test. We need to minimize a mis-classification error • Error: sum of distances of the misclassified points. • Let yi=-1 for misclassified +ve example i, • yi=1 otherwise. • Other definitions are also possible. Bafna

  22. Gradient Descent • The function D() defines the error. • We follow an iterative refinement. In each step, refine  so the error is reduced. • Gradient descent is an approach to such iterative refinement. D() D’()  Bafna

  23. Rosenblatt’s perceptron learning algorithm Bafna

  24. Classification based on perceptron learning • Use Rosenblatt’s algorithm to compute the hyperplane L=(,0). • Assign x to class 1 if f(x) >= 0, and to class 2 otherwise. Bafna

  25. Perceptron learning • If many solutions are possible, it does not choose between solutions • If data is not linearly separable, it will converge to some local minimum. • Time of convergence is not well understood Bafna

  26. Supervised classification • We have learnt a generic scheme for understanding biological data • Represent each data point (gene expression samples) as a vector in high dimensional space. • In supervised classification, we have two classes of points Bafna

  27. +  x2 - x1 Supervised classification • We have learnt a generic scheme for understanding biological data • Represent each data point (gene expression samples) as a vector in high dimensional space. • In supervised classification, we have two classes of points. • We separate the classes using a linear surface (hyperplane) • The perceptron algorithm is one of the simplest to implement Bafna

  28. End of Lecture Bafna

  29. +  x2 - x1 Linear Discriminant analysis • Provides an alternative approach to classification with a linear function. • Project all points, including the means, onto vector . • We want to choose  such that • Difference of projected means is large. • Variance within group is small Bafna

  30. + + 1 x2 x2 - - 2 x1 x1 Choosing the right  • 1 is a better choice than 2 as the variance within a group is small, and difference of means is large. • How do we compute the best ? Bafna

  31. Linear Discriminant analysis • Fisher Criterion Bafna

  32. +  x2 - x1 LDA cont’d • What is the projection of a point x onto ? • Ans: Tx • What is the distance between projected means? x Bafna

  33. LDA Cont’d Fisher Criterion Bafna

  34. LDA Therefore, a simple computation (Matrix inverse) is sufficient to compute the ‘best’ separating hyperplane Bafna

  35. Supervised classification summary • Most techniques for supervised classification are based on the notion of a separating hyperplane. • The ‘optimal’ separation can be computed using various combinatorial (perceptron), algebraic (LDA), or statistical (ML) analyses. Bafna

  36. The dynamic nature of the cell • Proteomic and transcriptomic analyses provide a snapshot of the active proteins (transcripts) in a cell in a particular state (columns of the matrix) • This snapshot allows us to classify the state of the cell. • Other queries are also possible (not in the syllabus). • Which genes behave in a similar fashion? Each row describes the behaviour of genes under different conditions. Solved by ‘clustering’ rows. • Find a subset of genes that is predictive of the state of the cell. Using PCA, and other dimensionality reduction techniques. Bafna

  37. Silly Quiz • Who are these people, and what is the occasion? Bafna

  38. Genome Sequencing and Assembly Bafna

  39. DNA Sequencing • DNA is double-stranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early. Bafna

  40. PCA: motivating example • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of g1 is not informative, and it suffices to look at g2 values. • Dimensionality can be reduced by discarding the gene g1 g1 g2 Bafna

  41. Principle Components Analysis • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of the two genes is highly correlated. • Projecting all the genes on a single line could explain most of the data. • This is a generalization of “discarding the gene”. Bafna

  42. Projecting • Consider the mean of all points m, and a vector emanating from the mean • Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)  m x x-m = T T(x-m) Bafna M

  43. Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… • Once proejcted, each samplemeans that all samples x can be represented by 2 (k) dimensional vector • 1T(x-m), 2T(x-m) 2 1 m x 1T(x-m) x-m 1T = Bafna M

  44. How to project • The generic scheme allows us to project an m dimensional surface into a k dimensional one. • How do we select the k ‘best’ dimensions? • The strategy used by PCA is one that maximizes the variance of the projected points around the mean Bafna

  45. PCA • Suppose all of the data were to be reduced by projecting to a single line  from the mean. • How do we select the line ? m Bafna

  46. PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize the error • Observation 1: Each point xk maps to x’k = m + T(xk-m) • (ak= T(xk-m)) xk  x’k m Bafna

  47. Proof of Observation 1 Differentiating w.r.t ak Bafna

  48. Minimizing PCA Error • To minimize error, we must maximize TS • By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector. • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue. Bafna

  49. PCA steps • X = starting matrix with n columns, m rows X xj Bafna

  50. Bafna

More Related