1 / 54

Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority. Sourav Chatterji UC Davis Genome Center schatterji@ucdavis.edu. Background. The Microbial World. Exploring the Microbial World. Culturing Majority of microbes currently unculturable .

Download Presentation

Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority SouravChatterji UC Davis Genome Center schatterji@ucdavis.edu

  2. Background

  3. The Microbial World

  4. Exploring the Microbial World • Culturing • Majority of microbes currently unculturable. • No ecological context. • Molecular Surveys (e.g. 16S rRNA) • “who is out there?” • “what are they doing?”

  5. Environmental Shotgun Sequencing

  6. Interpreting Metagenomic Data • Nature of Metagenomic Data • Mosaic • Fragmentary • New Sequencing Technologies • Enormous amount of data • Short Reads

  7. Overview of Talk • Metagenomic Binning • PhyloMetagenomics • The Big Picture/ Future Work

  8. Overview of Talk • Metagenomic Binning • Background • CompostBin [to appear in RECOMB 2008] • PhyloMetagenomics • The Big Picture

  9. Metagenomic Binning Classification of sequences by taxa

  10. Current Binning Methods • Assembly • Align with Reference Genome • Database Search [MEGAN, BLAST] • Phylogenetic Analysis • DNA Composition [TETRA,Phylopythia]

  11. Current Binning Methods • Need closely related reference genomes. • Poor performance on short fragments. • Sanger sequence reads 500-1000 bp long. • Current assembly methods unreliable • Complex Communities Hard to Bin.

  12. Genome Signatures • Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? • Yes [Karlin et al. 1990s] • What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

  13. DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

  14. DNA-composition metrics • Working with K-mers for Binning. • Curse of Dimensionality : O(4K) independent dimensions. • Statistical noise increases with decreasing fragment lengths. • Project data into a lower dimensional space to decrease noise. • Principal Component Analysis.

  15. PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

  16. Effect of Skewed Relative Abundance Abundance 1:1 Abundance 20:1 B. anthracis and L. monogocytes

  17. A Weighting Scheme For each read, find overlap with other sequences

  18. A Weighting Scheme 4 5 5 3 Calculate the redundancy of each position. Weight is inverse of average redundancy.

  19. N å = μ w X w i i = i 1 N å = - - T M w (X μ ) (X μ ) w i i w i w = i 1 Weighted PCA • Calculate weighted mean µw : • Calculates weighted co-variance matrix Mw • Principal Components are eigenvectors of Mw. • Use first three PCs for further analysis.

  20. Weighted PCA separates species PCA Weighted PCA B. anthracis and L. monogocytes : 20:1

  21. Un-supervised Classification

  22. Semi-Supervised Classification • 31 Marker Genes [courtesy Martin Wu] • Omni-present • Relatively Immune to Lateral Gene Transfer • Reads containing these marker genes can be classified with high reliability.

  23. Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

  24. The Semi-supervised Normalized Cut Algorithm • Calculate the K-nearest neighbor graph (KNN-graph) from the point set. • Update the KNN-graph with information from marker genes. • Bisect the graph using the normalized-cut algorithm.

  25. Apply algorithm recursively Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  26. Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  27. Testing • Simulate Metagenomic Sequencing • Variables • Number of species • Relative abundance • GC content • Phylogenetic Diversity • Test on a “real” dataset where answer is well-established.

  28. Future Directions • Holy Grail : Complex Communities • Semi-supervised methods • More marker genes • Semi-supervised projection? • Hybrid Methods • Assembly Information • Population Genetic Information

  29. Overview of Talk • Metagenomic Binning • Phylo-Metagenomics • Applications • Incorporating Alignment Accuracy • The Big Picture/ Future Work

  30. Population Structure of Communities Garcia Martin et al., Nat. Biotechnology (2006)

  31. Gene Family Characterization Yooseph et al., PLoS Biology (2007)

  32. Manual Masking • Require skilled and tedious manual intervention • Subjective and non-reproducible • Impractical for high throughput data • Frequently ignored. “Garbage-in-and-garbage-out”

  33. Gblocks

  34. Probabilistic Masking using pair-HMMs • Probabilistic formulation of alignment problem. • Can answer additional questions • Alignment Reliability • Sub-optimal Alignments Durbin et al., Cambridge University Press (1998)

  35. à Pr[x y , x, y] i j à = Pr[x y ] i j Pr[x, y] Probabilistic Masking • What is the probability residuesxiand yjare homologous? • Posterior Probability the residues xiand yjare homologous • Can be calculated efficiently for all pairs (and gaps) in quadratic time.

  36. å à d Pr[r r ] ij i j i, j å d ij i, j Scoring Multiple Alignments • Calculate the “posterior probability matrix” and distances dijbetween every pair of sequences. • Weighted “sum of pairs” score for column r:

  37. Testing The Balibase 3.0 Benchmark Database

  38. Testing • Realign sequences using MSA programs like Clustalw. • Sensitivity: for all correctly aligned columns, the fraction that has been masked as good • Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

  39. Performance Sensitivity Specificity Prob Mask 97% 93% Gblocks 53% 94%

  40. The Final Result A Phylogenetic Database/Pipeline (with Martin Wu)

  41. Overview of Talk • Metagenomic Binning • Phylo-Metagenomics • The Big Picture/ Future Work

  42. Population Structure Venter et al. , Science (2004) How to integrate information from multiple markers?

  43. Species-species Interactions

  44. Interactions in Microbial Communities

  45. Time Series Data Ruan et al., Bioinformatics (2006)

  46. Interaction Networks in Microbial Communities Ruan et al., Bioinformatics (2006)

  47. Functional Profiling Prediction of Metabolic Pathway Prediction of Gene Function

  48. Functional Profiling (with Binning) McCutcheon and Moran PNAS.(2007)

More Related