1 / 31

Clustering Large Data Sets in Gene expression analysis Daniel Weaver

Clustering Large Data Sets in Gene expression analysis Daniel Weaver. Overview. What is “Gene Expression”? Scientific questions and clustering techniques . “The Central Dogma”. The arrows represent the transfer or flow of information.

agatha
Download Presentation

Clustering Large Data Sets in Gene expression analysis Daniel Weaver

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Large Data Sets in Gene expression analysisDaniel Weaver

  2. Overview • What is “Gene Expression”? • Scientific questions and clustering techniques

  3. “The Central Dogma” • The arrows represent the transfer or flow of information. • DNA and RNA store information in a base-4 code (the four nucleotides). • Proteins store information in a base-20 code (the 20 amino acids). Transcription Translation DNA  RNA  Protein

  4. What’s in a name? • DNARNA = “Transcription” • because the information is exactly copied (or “transcribed”) from one base-4 system (DNA) to an equivalent base-4 system (RNA). Think of a monk transcribing a scroll. • RNAProtein = “Translation” • because the information is converted from a base-4 system (RNA) to a base-20 system (protein). Think of a monk translating a scroll into a new language.

  5. What is a “gene”? • “A gene is a segment of DNA that contains all the information necessary to code for some function.” • A gene is also the unit of information that is transferred through Transcription and Translation.

  6. Switching genes on (or off) • Purpose: to correctly control the amount of active functional (protein) product present in the cell or organism. Promoter Enhancer Figure taken, with permission from Alberts et al., Molecular Biology of the Cell

  7. Presence vs. expression • All cells have the same set of genes. • Different cell types express different subsets of their genes. • Constitutive genes are expressed in most cell types. • Cell-type specific genes are expressed in only a few cell types. A B C A B C

  8. Gene expression responds to the environment • Changes to the cell’s internal or external environment can lead to changes in gene expression. • Most human diseases manifest through a mis-regulation of gene expression A B C A B C

  9. Microarrays and related technologies

  10. Example - raw microarray data = more abundant in cell type A = more abundant in cell type B = equally abundant in both cell types

  11. log (ratioi) [log2(ratioi)]½ Interpreting raw data • Most gene expression detection data sets are expressed as a ratio of Red:Green (experiment:control) signal. • Frequently use a normalized log(red:green) ratio: for gene X Xi = Such that the Euclidean length of X is 1. • Interpreted raw data are tabulated in a Entity-by-Entity table, Genes-by-Experiments.

  12. Gene-by-Experiment table • Gene expression analysis is a variant of classic data mining – looking for informative patterns in the rows and columns of this type of table.

  13. Data volumes • ~120,000 genes in the human genome. • Expression detection techniques can take from 1-50 measurement simultaneously on each gene. • Many, diverse Gene and Experiment attributes • In 3-5 years, 105+ data sets will be available for analysis • Data volumes ranging from 10’s of Gb to a few Tb

  14. Analyzing Gene expression data • What genes are (or are not) expressed? • In different cells • Under different external conditions • In different disease states • How much does their expression change? • Does the change in expression correlate with other observed parameters? • Handled with descriptive statistics

  15. Clustering and Classifying gene expression • Scientific questions to be answered • Clustering techniques that are being applied • Lots of room and need for novel statistical and computational analyses

  16. Clustering Gene expression data • Functionally classify novel genes • Identify co-regulated gene groups • Identify diagnostic gene expression patterns

  17. Functionally Classifying Genes • Problem: • Genome sequencing projects identify many, previously unstudied genes. • Can one use the genes’ expression patterns to cluster genes that have similar function?

  18. Inputs and outputs • Inputs • A set of genes whose functional classification is know. • A set of genes whose functional classification is unknown. • Gene expression data sets for all the genes. • Desired Output • A “best fit” functional classification for each of the novel genes.

  19. Examples • Brown et al. (2000) PNAS 97(1), 262-267. • Input: • Log normalized data from 79 experiments on 2,467 genes • Trained on 2/3 of the genes, tested on remaining 3rd. • Classifiers tried include: Support Vector Machines and four machine learning algorithms (Parzen, FLD, C4.5, MOC1 ) • SVM’s performed the best and using the kernel: K(X,Y) = (XY+1)d (d=1,2,or 3) • This kernel transforms the data into higher dimensional space where it is easier to identify a separating hyperplane • Sensitivity = ~0.6

  20. Co-regulated genes • Problem: • Biological processes typically involve genes of many functional categories. • Knowledge of what genes act coordinately can help direct drug development Expression Group 1 Expression Group 2 Expression Group 3

  21. Inputs and Outputs • Inputs • Gene expression data for all genes of interest • (Information about the experimental conditions in which the gene expression data sets were collected) • Desired Outputs • Ordering of the input genes into sets of genes with related expression patterns

  22. Examples • Eisen et al. (1998) PNAS 95: 14863-14868 • Input: • Log normalized data from 12 experiments on 2,467 genes • Performed pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric • Gene that cluster together are displayed in a dendrogram wherein the branch lengths correlate to the degree of similarity

  23. Examples • Tavazoie et al. (1999) Nature Genetics 22:281-285. • Inputs: • “Variance-normalized” data from 15 experiments on 6,220 genes. Variance normalization is Xij = (Xij – Xi)/stdev(Xi) for gene i in experiment j. • Used Euclidean distance as the metric and performed k-means clustering, programmed to find 10, 30, and 60 centroids. • Gene clusters were shown to contain functionally related genes as expected.

  24. Diagnostic expression patterns • Problem: • Many diseases cannot be reliably distinguished through traditional techniques (microscopy, pathology, etc.) • Given gene expression data from diseased tissue, is there a set of genes that correctly distinguishes the diseases (as judged by other criteria).

  25. Inputs and Outputs • Inputs • Gene expression data for all genes (available) • Information about the patients afflicted with the complex disease of interest. • Desired output • The minimal set of genes that accurately partitions the disease, i.e. the minimal diagnostic gene expression pattern.

  26. Examples • Alizadeh et al. (2000) Nature 403: 503-511. • Input: • Log normalized data from 96 experiments on 4,026 genes (out of 17,856 measured). • The 96 experiments were performed on cancer biopsies from patients with Diffuse Large B-cell Lymphoma (DLBCL). • Pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric (Eisen et al., 1998). • Two previously unknown DLBCL sub-types distinguished by small gene clusters (~40 genes and ~70 genes) • Subtypes correspond to prognosis: • “GC B-like”  76% survivorship • “Activated B-like”  16% survivorship (Overhead)

  27. Summary • Current techniques include supervised and unsupervised classification • Three main scientific questions: • Functionally classifying genes • Identifying co-regulated sets of genes • Identifying diagnostic expression “fingerprints” • Data sets are relatively small now, but growing rapidly. • Classification draws from the expression data and from other domain knowledge. • Lots of room and need for novel statistical and computational analyses

  28. Further Reading Clustering Gene Expression Data • Alizadeh, et al. (2000) Nature 403: 503-511. • Alon, et al. (1999) PNAS 96: 6745-6750. • Butte and Kohane. (2000) Proceedings of Pacific Sym. Biocomputing. • Brown, et al. (2000) PNAS 97: 262-267. • Eisen, et al. (1998) PNAS 95: 14863-14868. • Iyer, et al. (1999) Science 283: 83-87. • Raychaudhuri, et al. (2000) Proceedings of Pacific Sym. Biocomputing. • Roberts, et al. (2000) Science 287: 873-880. • Ross et al. (2000) Nature Genetics 24:227-235. • Scherf, et al. (2000) Nature Genetics 24: 236-244. • Spellman, et al. (1998) Mol Biol Cell 9: 3273-3297. • Tamayo, et al. (1999) PNAS 96: 2907-2912. • Tavazoie, et al. (1999) Nature Gen 22: 281-285. • Zhu and Zhang. (2000) Proceedings of Pacific Sym. Biocomputing.

  29. Further Reading Other related gene expression papers: • Holstege, et al. (1998) Cell 95:717-728. • DeRisi et al. (1996) Nature Genetics 14:457-460. • Schena et al. (1995) Science 270:467-470. • DeRisi et al. (1997) Science 278:680-686. • Hilsenbeck et al. (1999) J. Natl. Cancer Inst. 91:453-459.

  30. Expression Data sets • European Bioinformatics Institute (EBI) (links to refs. 4,5,6,11) • Main microarray page • http://www.ebi.ac.uk/microarray/ • Microarray public data set page (this is a great portal site from which you can browse to many of the published data sets) • http://industry.ebi.ac.uk/~brazma/Data-mining/microarray.html • National Human Genome Research Institute (NHGRI) • Main page • http://www.nhgri.nih.gov/DIR/LCG/15K/HTML/ • Data set down load page • ftp://kronos.nhgri.nih.gov/pub/outgoing/olga/old/ • National Cancer Institute (NCI) (ref. 9 & 10) • Main page • http://discover.nci.nih.gov/ • Data set down load page • http://discover.nci.nih.gov/nature2000/ • Lymphoma data set (ref. 1) • Main page • http://llmpp.nih.gov/ • Data set download page • http://llmpp.nih.gov/lymphoma/

  31. Daniel Weaver

More Related