1 / 37

Evaluation of Two Methods to Cluster Gene Expression Data

Evaluation of Two Methods to Cluster Gene Expression Data. Odisse Azizgolshani Adam Wadsworth Protein Pathways SoCalBSI. Overview:. Background information Statement of the project Materials and methods Results Discussion and conclusion Acknowledgements. Microarray Data.

maddy
Download Presentation

Evaluation of Two Methods to Cluster Gene Expression Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Two Methods to Cluster Gene Expression Data Odisse Azizgolshani Adam Wadsworth Protein Pathways SoCalBSI

  2. Overview: • Background information • Statement of the project • Materials and methods • Results • Discussion and conclusion • Acknowledgements

  3. Microarray Data • Transcriptional response of genes to variations in cellular states • Cellular States: Mutations, Compound-Treated • The data values are the log ratios of the level of gene expression in the mutant or compound-treated state over the level of expression in the wild-type state

  4. Clustering • Clustering: Organizing into groups genes with similar expression profiles • Correlation Coefficient: The metric used to determine the similarity between two expression profiles • Hierarchical Clustering: A way of forming a multi-level hierarchy of gene expression profiles, which can be cut off at certain places to form gene clusters • Project: Evaluating two different methods of hierarchically clustering expression data

  5. EXPRESSION PROFILES Hierarchical ClusteringMethod 1 Correlation Calculations GENE CORRELATIONS Linking genes by expression similarity DENDROGRAM gene 1 gene 2 gene 3 gene 4 gene 5

  6. Method 1 Example Hughes, T.R., et al. (2000). Functional Discovery via a Compendium of Expression Profiles. Cell 102, 109-126.

  7. EXPRESSION PROFILES Hierarchical ClusteringMethod 2 Correlation Calculations GENE CORRELATIONS 1 GENE CORRELATIONS 2 Correlation of Correlations Linking genes by correlation similarity DENDROGRAM gene 1 gene 2 gene 3 gene 4 gene 5

  8. Method 2 Example Provided by Matteo Pelligrini, Protein Pathways.

  9. Applications of Clustering • Functional Genomics: Gaining information about the possible function of genes with unknown function Looking at the function of genes that cluster together with genes of unknown function • Diagnostics: Tissues from clinical samples can be clustered together to determine disease subtypes (e.g. tumor classification)

  10. Project Details • Project Question: In the process of hierarchically clustering gene expression data, which metric generates better clusters: 1. The correlation of gene expression ratios (Method 1) 2. The correlation of the correlations (Method 2) • Dataset: Yeast microarray gene expression data (6317 genes, 300 strains)* • Programming Environment: MATLAB v 6.5 *Hughes, T.R., et al. (2000). Functional Discovery via a Compendium of Expression Profiles. Cell 102, 109-126.

  11. Two Approaches • Problem: Determining the quality of clusters formed so as to evaluate the two clustering methods • Approach I: Determine the quality of the clusters by seeing if genes with the same function have clustered together more often in one method over the other method • Approach II: Determine the quality of the clusters by analyzing the variances of the clusters and seeing if there is a difference between clustering methods

  12. Approach I • Gene Function Analysis • If clusters contain lots of genes with the same function (i.e. transcription, then the clustering method is good. • Two Function Annotation Options • 2221 annotated genes with 318 different functions obtained from http://mips.gsf.de/genre/proj/yeast/index.jsp • 1155 annotated genes with 99 different functions obtained from http://genome.ad.jp/kegg/kegg2.html

  13. 6317 genes ANNOTATE Approach I Steps 2000 genes CLUSTER Cluster 1 Cluster 2 Cluster 3 1000 genes 600 genes 400 genes For both annotation options… • Out of the 6317 yeast genes, select only those genes that have known functions • Cluster the genes according to the two methods • For each cluster, compare each gene to every other gene in that cluster and see how many pairs have the same function • If a cluster contains n genes, then there are (n)(n-1) / 2 gene pairs to compare NUMBER OF PAIRS TO COMPARE 499500 pairs 179700 pairs 79800 pairs 759000 pairs total NUMBER OF PAIRS THAT HAVE SAME FUNCTION 150000 pairs 40200 pairs 3204 pairs 193404 pairs same 193404 / 759000 = 0.25 When the genes are partitioned into three clusters, 25% of the gene pairs have the same function

  14. Approach I Results

  15. Approach II (1): • Determining the quality of the clusters based on their volume. • Comparing the average volume of the clusters generated by method 1 and method 2.

  16. Approach II (2): • If there are M genes in each cluster and for each gene, N experiments are chosen: We’ll have: • M vectors in a N-dimensional space that can be visualized as M points. • The M points generate an ellipsoid if M > dimensionality of the space. • The closer the points to each other, the more correlated they are together, and the smaller the volume of the cluster. • The smaller the volume of the ellipsoid (cluster), the better the quality of the cluster.

  17. M original points in the 3-D space ? ? ? j i k ? D2 Centered ellipsoid with known axes

  18. Approach II (3): • To compute the volume of the cluster, we first compute its covariance matrix. • We then use Principal Components Analysis (PCA) to estimate the dimensions of the cluster. • PCA will construct a new space using N orthogonal linear combinations of old vectors of the space. (Each linear combination is a principal component.)

  19. Approach II (4): • In the new space, the ellipsoid is transformed into a centered ellipsoid, and the covariance matrix is diagonalized. • The axes of the centered ellipsoid are the elements on the diagonal of the diagonalized matrix, which are the variance of the data points in the new space along the principal components. • The volume of the ellipsoid = 4/3 x  x D1xD2 x…x DN

  20. M original points in the 3-D space ? ? Covariance matrix 0.0058 0.0015 0.0057 0.0015 0.0068 0.0023 0.0057 0.0023 0.0232 ? j i k v2 v1 PCA and diagonalizing the covariance matrix v3 Diagonalized Covariance Matrix 0.0039 0 0 0 0.0066 0 0 0 0.0253 D2 D1 D3 V1 = c1i + c2j + c3k V2 = c4i + c5j + c6k V3 = c7i + c8j + c9k D1 M points in the new 3-D space

  21. Approach II (5): Question: • Is one of the methods systematically generating smaller ellipsoids?

  22. Volume Calculation Results:

  23. Similarity of the Clusters from Two Methods: • Make a KxK matrix (K: the number of clusters in each method) whose elements are: aij = Difference (cluster i in method I , cluster j in method II) N((A,B)) • Difference (A,B) = N(AB) + N(AB) where: • A and B are two sets (Here: cluster i from method I and cluster j from method II) • (A,B): symmetrical difference of A and B • N((A,B)) = N(A-B) + N(B-A) • AB: Union of the two sets • AB: Intersect of the two sets • 0 < Difference(A,B) < 1

  24. An Example: A B A B B A 26 94 10 11 1 2 6 32 21 7 89 43 1 43 89 22 16 73 1 3 54 76 98 6 45 11 88 23 13 A-B = {1,2,32,7,89} N(A-B) = 5 B-A = {26,94,10,11} N(B-A) = 4 N(AB) = N(A-B) + N(B-A) = 5 + 4 = 9 AB = {1,2,6,32,21,7,89,43,26,94,10,11} N(AB) = 12 AB = {6,21,43} N(AB) = 3 Dissimilarity score (A,B) = 9/(12 + 3 ) = 0.6 N(A-B) = A N(B-A) = B N(AB) = N(A) + N(B) N(AB) = N(A) + N(B) N(AB) = 0 Dissimilarity score = 1 A = B A – B = B – A = Ø N(AB) = 0 Dissimilarity score(A,B) = 0 A: cluster i from method I B: cluster j from method II

  25. Results: Dissimilarity Matrix Dissimilarity score for cluster 1 from method 1 and cluster 2 from method 2 K: the number of clusters generated for each method

  26. Discussion and Conclusion (1): • Conclusions: • Neither approach can favor one method over the other with certainty; however, • Approach I favors method I when the number of clusters is small. • In the range of 1-100 clusters, while approach I favors method I, approach II fluctuates in choosing the better method or the other. • The efficiency of both approaches in clustering genes is dependant on the number of clusters. • The similarity of the clusters from method I and method II decreases as the number of clusters increases; in fact, the two methods generate very different clusters.

  27. Discussion and Conclusion (2): • Problems faced and future questions: • What is the best cutoff value for clustering? • In approach I, not all genes were annotated, so around 2/3 of the dataset was ignored. • Gene annotations are somewhat arbitrary. • What are other ways to quantify the quality of clusters? • Memory problem: We couldn’t include all the genes and all the experiments at the same time to analyze the quality of clusters.

  28. Acknowledgments: Special thanks to: • Our mentor: Dr. Matteo Pellegrini • Protein Pathways team: Dr. Darin Taverna Dr. Peter Bowers Dr. Mike Thompson Leon Kopelevich • SoCalBSI faculty: Dr. Jamil Momand Dr. Silvia Heubach Dr. Sandra Sharp Dr. Elizabeth Torres Dr. Wendie Johnston Dr. Jennifer Faust Dr. Nancy Warter-Perez Dr. Beverly Krilowicz • NIH and NSF : whose funding made this internship possible.

  29. Appendix I: Covariance (1) • The covariance of two features is the measure of how the two features vary together. • If they both have an increasing or decreasing trend, c ij> 0. • If one decreases while the other one increases, c ij < 0. • If the changes of one is independent of the changes of the other, c ij = 0. * *: http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_Mahal/cov.htm

  30. Appendix I: Covariance (2) • If we have M variables and each variable has N measurements, the covariance matrix can be obtained as below:  (xi - x) (yi - y ) cij = M Where: cij ( i  j ) is the covariance of (measurement i and measurement j) for all M variables. • The diagonals are the variances of each measurement. • Variance: A measure of how much the points vary around the mean.

  31. Appendix I: Diagonalizing The Covariance Matrix  (xi - x) (yi - y ) cij = M data = -0.0300 0.2500 0.0340 0.1430 0.2230 0.1900 -0.0230 0.0410 -0.0110 -0.0060 0.1780 0.3100 -0.0400 0.2120 -0.0560 covariance_data = 0.0058 0.0015 0.0057 0.0015 0.0068 0.0023 0.0057 0.0023 0.0232 covariance_data = cov(data); Covariance matrix Diagonalize the Covariance matrix [V, D] = eig (covariance_data); V = -0.9285 -0.2370 0.2859 0.2856 -0.9478 0.1417 0.2374 0.2133 0.9477 D = 0.0039 0 0 0 0.0066 0 0 0 0.0253

  32. Appendix II: Eigenvalues and Eigenvectors An eigenvector of an nxn matrix is a nonzero vector xsuch that Ax =x forsome scalar . A scalar  is called an eigenvalue of A if there is a nontrivial solution x of Ax = x; such an x is called an eigenvector corresponding to .* *: Lay, C. David. Linear Algebra and Its Applications. 3rd ed. P. 303

  33. Appendix III: Principal Components Analysis (PCA) data = -0.0300 0.2500 0.0340 0.1430 0.2230 0.1900 -0.0230 0.0410 -0.0110 -0.0060 0.1780 0.3100 -0.0400 0.2120 -0.0560 [ pcs , newdata , variances , t2 ] = princomp (data) ; PCA newdata = -0.0576 -0.0691 -0.0417 0.1359 -0.0512 0.0896 -0.1278 0.1178 0.0352 0.2006 0.0524 -0.0644 -0.1511 -0.0499 -0.0188 t2 = 1.2996 3.1981 3.0600 3.0737 1.3686 variances = 0.0253 0.0066 0.0039 pcs = 0.2859 -0.2370 0.9285 0.1417 -0.9478 -0.2856 0.9477 0.2133 -0.2374

  34. Volume Calculation Results (20 random experiments, 10-20 clusters:

  35. Volume Calculation Results: (300 experiments chosen first, then the dimensionality of the space was reduced to 20):

  36. Method 1 vs. Method 2 in a More Conceptual View: • Method 1 links together the two genes that have the most similar expression patterns. • Method 2 links together the two genes whose correlation with all other genes is most similar; i.e. it looks at a genes in a more global view (in a context of all other genes).

  37. Appendix IV: 1 2 3 4 5 4 4 1 2 2 1 5 3 3 5 3 3 1 2 1 2 2 5 3 3 4 5 4 4 5 1

More Related