1 / 48

Supervised and unsupervised methods for large scale genomic data integration

Supervised and unsupervised methods for large scale genomic data integration. Curtis Huttenhower 03-25-10. Harvard School of Public Health Department of Biostatistics. Greatest Biological Discoveries?. Are We There Yet?. Species Diversity of Environmental Samples.

valora
Download Presentation

Supervised and unsupervised methods for large scale genomic data integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower 03-25-10 Harvard School of Public Health Department of Biostatistics

  2. Greatest Biological Discoveries?

  3. Are We There Yet? Species Diversity ofEnvironmental Samples • How much biology is out there? • How much have we found? • How fast are we finding it? Fierer 2008 Human Proteins withAnnotated Biological Roles Age-Adjusted Citation Rates forMajor Sequencing Projects #DistinctRoles Matt Hibbs

  4. Are We There Yet? Species Diversity ofEnvironmental Samples Lots! • How much biology is out there? • How much have we found? • How fast are we finding it? Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Not nearly all Not fast enough Fierer 2008 Human Proteins withAnnotated Biological Roles Age-Adjusted Cost per Citation forMajor Sequencing Projects #DistinctRoles Matt Hibbs

  5. Outline 2. Details: Recovering mechanistic detail from high-throughput data 1. Big picture: Algorithms for mining genome-scale datasets 3. Applications: Microbial communities and functional metagenomics

  6. A framework for functional genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation = + Frequency Not coloc. Coloc. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation

  7. Functional networkprediction and analysis Global interaction network HEFalMp Currently includes data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Metabolism network Signaling network Gut community network

  8. HEFalMp: Predicting human gene function HEFalMp

  9. HEFalMp: Predicting humangenetic interactions HEFalMp

  10. HEFalMp: Analyzing human genomic data HEFalMp

  11. HEFalMp: Understanding human disease HEFalMp

  12. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

  13. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 = + Following up with semi-supervised approach

  14. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Chemotaxis

  15. Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis

  16. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Chemotaxis Flagellar assembly

  17. Functional Mapping:Scoring Functional Associations How can we formalizethese relationships? • Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

  18. Functional Mapping:Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. • Scoring functional associations is great… …how do you interpret an association score? • For gene sets of arbitrary sizes? • In arbitrary graphs? • Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph

  19. Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  20. Functional Mapping:Functional Associations Between Processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

  21. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

  22. Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data

  23. Outline 2. Details: Recovering mechanistic detail from high-throughput data 1. Big picture: Algorithms for mining genome-scale datasets 3. Applications: Microbial communities and functional metagenomics

  24. How do functional interactionsbecome pathways? • Gene expression • Physical PPIs • Genetic interactions • Colocalization • Sequence • Protein domains • Regulatory binding sites • … ? = +

  25. Simultaneous inference of physical, genetic, regulatory, and functional networks With Chris Park, Olga Troyanskaya Functional interactions Regulatory interactions Post-transcriptional regulation Phosphorylation Metabolic interactions Protein complexes Functional genomic data

  26. Learning a compendium of interaction networks Train one SVM per interaction type Resolve consistency using hierarchical Bayes net

  27. Learning a compendium of interaction networks Both presence/absence and directionality of interactions are accurately inferred AUC 0.5 1.0

  28. Using network compendia to predictcomplete pathways With David Hess Additional 20 novel synthetic lethality predictions tested,14 confirmed(>100x better than random) Confirmed Unconfirmed

  29. Interactive aligned network viewer –coming soon! Graphle

  30. Outline 2. Details: Recovering mechanistic detail from high-throughput data 1. Big picture: Algorithms for mining genome-scale datasets 3. Applications: Microbial communities and functional metagenomics

  31. Microbial Communities andFunctional Metagenomics With Jacques Izard, Wendy Garrett • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Pathogen collections of “single” organisms form similar communities • Another data integration problem • Must include datasets from multiple organisms • What questions can we answer? • What pathways/processes are present/over/under-enriched in a newly sequences microbe/community? • What’s shared within community X?What’s different? What’s unique? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • Current functional methods annotate~50% of synthetic data, <5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2

  32. Data Integration for Microbial Communities ~300 available expression datasets ~30 species DLD DLD • Data integration works just as well in microbes as it does in yeast and humans • We know an awful lot about some microorganisms and almost nothing about others • Sequence-based and network-based tools for function transfer both work in isolation • We can use data integration to leverage both and mine out additional biology ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997

  33. Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration ← Precision ↑, Recall ↓

  34. Functional maps for cross-speciesknowledge transfer Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓

  35. Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.

  36. Outline • Bayesian and unsupervised methods for data integration • HEFalMp system for human data analysis and integration • Functional mapping to statistically summarize large data collections • Simultaneous inference of an interaction network compendium • Accurate prediction of interaction types and directionality • Validated pathways and specific individual interactions in yeast 2. Details: Recovering mechanistic detail from high-throughput data 1. Big picture: Algorithms for mining genome-scale datasets • Integration for microbial communities and metagenomics • Sleipnir software for efficient large scale data mining 3. Applications: Microbial communities and functional metagenomics

  37. Thanks! Jacques Izard Hilary Coller Erin Haley Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Wendy Garrett Sarah Fortune Tracy Rosebrock http://huttenhower.sph.harvard.edu/sleipnir http://function.princeton.edu/hefalmp NIGMS

  38. Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

  39. Functional maps for cross-speciesknowledge transfer O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 … G2 G3 O1 G4 G1 O2 G5 G6 O3 G7 O5 O4 G8 G9 G10 O8 O6 G12 G11 G13 O7 G16 G15 O9 G14 G17

  40. Functional maps for functional metagenomics GOS 4441599.3Hypersaline Lagoon, Ecuador + KEGG Pathways Integrated functional interaction networks in 27 species Mapping organisms into phyla Env. Organisms Pathogens = Mapping genes into pathways Mapping pathways into organisms

  41. Functional maps for functional metagenomics Edges Process association in obesity LessCoregulated Baseline (no change) MoreCoregulated Nodes Process cohesiveness in obesity VeryDownregulated Baseline (no change) Very Upregulated

  42. Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort With ShujiOgino, Charlie Fuchs Health Professionals Follow-Up Study • LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level Nurse’s HealthStudy ~3,100gastrointestinal subjects ~2,100cancer mutation tests ~3,800tissue samples ~1,200LINE-1 methylation ~1,450colon cancer samples ~1,150CpG island methylation • DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, YujinHoshida ~775gene expression ~700TMAimmuno-histochemistry

  43. Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation Nonnegative matrix factorization C1 C2 C3 C4 Tumors → ← Genes Cell cycle regulation Chr. 19 rearrangement,membrane receptors/channels Angiogenesis, proliferation HSC signature Neural/ESC signature BRCAinteractors,chrom. stability factors

  44. Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation CD133 + Bcl-X(L) Subramanian et al, 2005 HematopoeiticStem Cell Signature NeuralStem Cell Signature CD44 + CD166 166 799 945 195 678 18 146 7 Chr. 19q Note that these regulatory programsdo not appear to correspondwith demographics or commonpathologic markers…Testing now for correlation with outcome. BAX 8 325 • Hypotheses? • Two main pathways to proliferation: • HSC program + BAX • ESC/NSC program • Two main pathways to deregulation: • Angiogenesis + chrom. instability • Cell cycle disruption (MSI?) EmbryonicStem Cell Signature

  45. Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Ogino et al, 2008 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? ρ = 0.718, p < 0.01

  46. Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. This suggests a copy number variation. What is the biological mechanism linking LINE-1 methylation to colon cancer? This suggests a genetic effect. ρ = 0.718, p < 0.01

  47. Epigenetics of Colorectal Cancer:LINE-1 methylation levels • Preliminary Data • 10 genes differentially expressed even using simple methods • 1/3 are from the same family with known GI tumor prognostic value • 1/3 are X-chromosome testis/cancer-specific antigens • 1/2 fall in same cytogenic band, which is also a known CNV hotspot • HEFalMp links to a cascade of antigens/membrane receptors/TFs • Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays • GSEA pulls out a wide range of proliferation up (E2F), immune response down; need to regress out prognosis correlates Check back in acouple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer?

More Related