1 / 44

Large scale genomic data mining

Large scale genomic data mining. Curtis Huttenhower 11-14-09. Harvard School of Public Health Department of Biostatistics. Greatest Biological Discoveries?. Are We There Yet?. Species Diversity of Environmental Samples. How much biology is out there? How much have we found?

tybalt
Download Presentation

Large scale genomic data mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large scalegenomic data mining Curtis Huttenhower 11-14-09 Harvard School of Public Health Department of Biostatistics

  2. Greatest Biological Discoveries?

  3. Are We There Yet? Species Diversity ofEnvironmental Samples • How much biology is out there? • How much have we found? • How fast are we finding it? Schloss and Handelsman, 2006 Human Proteins withAnnotated Biological Roles Age-Adjusted Citation Rates forMajor Sequencing Projects #DistinctRoles Matt Hibbs

  4. Are We There Yet? Species Diversity ofEnvironmental Samples Lots! • How much biology is out there? • How much have we found? • How fast are we finding it? Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Not nearly all Not fast enough Schloss and Handelsman, 2006 Human Proteins withAnnotated Biological Roles Age-Adjusted Cost per Citation forMajor Sequencing Projects #DistinctRoles Matt Hibbs

  5. Outline 1. Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

  6. A Framework for Functional Genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation Frequency Not coloc. Coloc. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation

  7. A Framework for Functional Genomics Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

  8. Predicted Functional Interaction Networks Global interaction network Currently have data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Metabolism network Fibroblast network Colon cancer network

  9. Functional Mapping:Mining Integrated Networks Predicted relationships between genes The average strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Cell cycle genes

  10. Functional Mapping:Mining Integrated Networks Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  11. Functional Mapping:Mining Integrated Networks Predicted relationships between genes The average strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Cell cycle genes DNA replication genes

  12. Functional Mapping:Scoring Functional Associations How can we formalizethese relationships? • Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

  13. Functional Mapping:Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. • Scoring functional associations is great… …how do you interpret an association score? • For gene sets of arbitrary sizes? • In arbitrary graphs? • Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph

  14. Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

  15. Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  16. Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  17. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

  18. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data

  19. HEFalMp: Predicting HumanGene Function HEFalMp

  20. HEFalMp: Predicting HumanGenetic Interactions HEFalMp

  21. HEFalMp: Analyzing HumanGenomic Data HEFalMp

  22. HEFalMp: UnderstandingHuman Disease HEFalMp

  23. Outline 1. Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

  24. Microbial Communities andFunctional Metagenomics With Jacques Izard, Wendy Garrett • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Pathogen collections of “single” organisms form similar communities • Another data integration problem • Must include datasets from multiple organisms • What questions can we answer? • What pathways/processes are present/over/under-enriched in a newly sequences microbe/community? • What’s shared within community X?What’s different? What’s unique? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • Current functional methods annotate~50% of synthetic data, <5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2

  25. Data Integration for Microbial Communities ~350 available expression datasets ~25 species DLD DLD • Data integration should work just as well in microbes as it does in yeast and humans • We know an awful lot about some microorganisms and almost nothing about others • Sequence-based and network-based tools for function transfer both work in isolation • We can use data integration to leverage both and mine out additional biology ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997

  26. Functional Maps forFunctional Metagenomics KO1: YG1, YG2, YG3 KO2: YG4 KO3: YG6 … ECG1, ECG2 PAG1 ECG3, PAG2 … YG2 YG3 KO1 YG4 YG1 KO2 YG5 YG6 KO3 YG7 KO5 KO4 YG8 YG9 YG10 KO8 KO6 YG12 YG11 YG13 KO7 YG16 YG15 KO9 YG14 YG17

  27. Functional Maps forFunctional Metagenomics

  28. Validating Orthology-BasedFunctional Mapping Does unweighted data integration predict functional relationships? What is the effect of “projecting” through an orthologous space? GO GO Individual datasets Unsupervised integration log(Precision/Random) log(Precision/Random) Recall Recall KEGG KEGG Unsupervised integration Individual datasets log(Precision/Random) log(Precision/Random) Recall Recall

  29. Validating Orthology-BasedFunctional Mapping YG2 YG3 Holdout set, uncharacterized “genome” YG4 YG1 YG5 Random subsets, characterized “genomes” YG6 YG7 YG8 YG9 YG10 YG12 YG11 YG13 YG15 YG16 YG14 YG17

  30. Validating Orthology-BasedFunctional Mapping

  31. Validating Orthology-BasedFunctional Mapping Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? GO GO • What have we learned? • Yeast is incredibly well-curated • KEGG tends to be more specific than GO • Predictinginteractomes by projecting through functional maps works decently in the absolute best case 0.30 0.37 0.68 0.48 0.40 0.43 0.39 0.25 0.27 0.39 KEGG KEGG

  32. Functional Maps forFunctional Metagenomics • Now, what happens if you do this forcharacterized microbes? • ~10 (somewhat) well-characterized species • 1-35 datasets each • Integrate within species • Evaluate using KEGG • Then cross-validate by holding out species KEGG Unsupervised integrations Check back soon for more results, preliminary data on metagenomes log(Precision/Random) Recall

  33. Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! It’s also speedy: improves on Bayes Net Toolbox by ~22x in memory usage and up to >100x in runtime.

  34. Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! 8 hours Original processing time 30 years 1 minute 2 months Current processing time 18 hours 2.5 hours

  35. Outline 1. Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

  36. Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort With ShujiOgino, Charlie Fuchs Health Professionals Follow-Up Study • LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level Nurse’s HealthStudy ~3,100gastrointestinal subjects ~2,100cancer mutation tests ~3,800tissue samples ~1,200LINE-1 methylation ~1,450colon cancer samples ~1,150CpG island methylation • DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, YujinHoshida ~775gene expression ~700TMAimmuno-histochemistry

  37. Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation Nonnegative matrix factorization C1 C2 C3 C4 Tumors → ← Genes Cell cycle regulation Chr. 19 rearrangement,membrane receptors/channels Angiogenesis, proliferation HSC signature Neural/ESC signature BRCAinteractors,chrom. stability factors

  38. Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation CD133 + Bcl-X(L) Subramanian et al, 2005 HematopoeiticStem Cell Signature NeuralStem Cell Signature CD44 + CD166 166 799 945 195 678 18 146 7 Chr. 19q Note that these regulatory programsdo not appear to correspondwith demographics or commonpathologic markers…Testing now for correlation with outcome. BAX 8 325 • Hypotheses? • Two main pathways to proliferation: • HSC program + BAX • ESC/NSC program • Two main pathways to deregulation: • Angiogenesis + chrom. instability • Cell cycle disruption (MSI?) EmbryonicStem Cell Signature

  39. Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Ogino et al, 2008 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? ρ = 0.718, p < 0.01

  40. Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. This suggests a copy number variation. What is the biological mechanism linking LINE-1 methylation to colon cancer? This suggests a genetic effect. ρ = 0.718, p < 0.01

  41. Epigenetics of Colorectal Cancer:LINE-1 methylation levels • Preliminary Data • 10 genes differentially expressed even using simple methods • 1/3 are from the same family with known GI tumor prognostic value • 1/3 are X-chromosome testis/cancer-specific antigens • 1/2 fall in same cytogenic band, which is also a known CNV hotspot • HEFalMp links to a cascade of antigens/membrane receptors/TFs • Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays • GSEA pulls out a wide range of proliferation up (E2F), immune response down; need to regress out prognosis confounds Check back in acouple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer?

  42. Outline • Bayesian system for genomic data integration • HEFalMp system for human data analysis and integration • Functional mapping to statistically summarize large data collections • Integration for microbial communities and metagenomics • Network alignment and mapping for microbial community analysis • Sleipnir software for efficient large scale data mining 1. Methodology: Algorithms for mining genome-scale datasets • Demographic/molecular/ genomic data for ~1,000 colorectal cancers • Ongoing analysis of gene activity and LINE-1 methylation 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

  43. Thanks! ShujiOgino Charlie Fuchs Jacques Izard Hilary Coller Erin Haley TshekoMutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi FlorianMarkowetz Interested? We’re recruiting students and postdocs! Biostatistics Department http://huttenhower.sph.harvard.edu Wendy Garrett http://function.princeton.edu/hefalmp http://function.princeton.edu/sleipnir NIGMS

More Related