1 / 32

Answering biological questions using large genomic data collections

Answering biological questions using large genomic data collections. Curtis Huttenhower 10-05-09. Harvard School of Public Health Department of Biostatistics. A Definition of Computational Functional Genomics. Prior knowledge. Genomic data. Gene ↓ Function. Gene ↓ Gene. Data ↓

channer
Download Presentation

Answering biological questions using large genomic data collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics

  2. A Definition ofComputational Functional Genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

  3. MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Low Correlation High Correlation

  4. MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Unrelated Gene Pairs BRCA2SOX2 0.1 RAD51FOXP2 0.2 ACTR1H6PD 0.15 … Low Correlation High Correlation

  5. MEFIT: A Framework forFunctional Genomics Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

  6. MEFIT: A Framework forFunctional Genomics Functional area Tissue Disease … Functional Relationship Biological Context Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

  7. Functional Interaction Networks Global interaction network Currently have data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases MEFIT Vacuolar transport network Autophagy network Translation network

  8. Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  9. Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  10. Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

  11. Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Cell cycle genes

  12. Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  13. Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Cell cycle genes DNA replication genes

  14. Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

  15. Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  16. Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes AHP1 DOT5 GRX1 GRX2 … Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism APE3 LAP4 PAI3 PEP4 … Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  17. HEFalMp: Predicting human gene function HEFalMp

  18. HEFalMp: Predicting humangenetic interactions HEFalMp

  19. HEFalMp: Analyzing human genomic data HEFalMp

  20. HEFalMp: Understanding human disease HEFalMp

  21. Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

  22. Comprehensive Validation of Computational Predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Laboratory Experiments Growth curves Petite frequency Confocal microscopy

  23. Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

  24. Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months

  25. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

  26. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a researcher take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data

  27. Thanks! Hilary Coller Erin Haley TshekoMutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi FlorianMarkowetz ShujiOgino Charlie Fuchs Interested? I’m accepting students and postdocs! http://www.huttenhower.org http://function.princeton.edu/hefalmp NIGMS

  28. Next Steps:Microbial Communities • Data integration is off to a great start in humans • Complex communities of distinct cell types • Very sparse prior knowledge • Concentrated in a few specific areas • Variation across populations • Critical to understand mechanisms of disease

  29. Next Steps:Microbial Communities • What about microbial communities? • Complex communities of distinct species/strains • Very sparse prior knowledge • Concentrated in a few specific species/strains • Variation across populations • Critical to understand mechanisms of disease

  30. Next Steps:Microbial Communities ~120 available expression datasets ~70 species DLD DLD • Data integration works just as well in microbes as it does in humans • We know an awful lot about some microorganisms and almost nothing about others • Purely sequence-based and purely network-based tools for function transfer both fall short • We need data integration to take advantage of both and mine out useful biology! ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997

  31. Next Steps:Functional Metagenomics • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Another data integration problem • Must include datasets from multiple organisms • Another context-specificity problem • Now “context” can also mean “species” • What questions can we answer? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • What’s shared within community X?What’s different? What’s unique? • What’s perturbed in disease state Y?One organism, or many? Host interactions? • Current methods annotate ~50% of synthetic data,<5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2

More Related