compostbin a dna composition based metagenomic binning algorithm n.
Skip this Video
Loading SlideShow in 5 Seconds..
CompostBin : A DNA composition based metagenomic binning algorithm PowerPoint Presentation
Download Presentation
CompostBin : A DNA composition based metagenomic binning algorithm

play fullscreen
1 / 33
Download Presentation

CompostBin : A DNA composition based metagenomic binning algorithm - PowerPoint PPT Presentation

orrin
156 Views
Download Presentation

CompostBin : A DNA composition based metagenomic binning algorithm

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. CompostBin : A DNA composition based metagenomic binning algorithm SouravChatterji*, Ichitaro Yamazaki, ZhaojunBai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

  2. Overview of Talk • Metagenomics and the binning problem. • CompostBin

  3. The Microbial World

  4. Exploring the Microbial World • Culturing • Majority of microbes currently unculturable. • No ecological context. • Molecular Surveys (e.g. 16S rRNA) • “who is out there?” • “what are they doing?”

  5. Metagenomics

  6. Interpreting Metagenomic Data • Nature of Metagenomic Data • Mosaic • Intraspecies polymorphism • Fragmentary • New Sequencing Technologies • Enormous amount of data • Short Reads

  7. Metagenomic Binning Classification of sequences by taxa

  8. Binning in Action • Glassy Winged Sharpshooter (Homalodisca coagulata). • Feeds on plant xylem (poor in organic nutrients). • Microbial Endosymbionts

  9. Current Binning Methods • Assembly • Align with Reference Genome • Database Search [MEGAN, BLAST] • Phylogenetic Analysis • DNA Composition [TETRA,Phylopythia]

  10. Current Binning Methods • Need closely related reference genomes. • Poor performance on short fragments. • Sanger sequence reads 500-1000 bp long. • Current assembly methods unreliable • Complex Communities Hard to Bin.

  11. Overview of Talk • Metagenomics and the binning problem. • CompostBin

  12. Genome Signatures • Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? • Yes [Karlin et al. 1990s] • What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

  13. Imperfect World • Horizontal Gene Transfer • Recent Estimates [Ge et al. 2005] • Varies between 0-6% of genes. • Typically ~2%. • But… • Amelioration

  14. DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

  15. DNA-composition metrics • Working with K-mers for Binning. • Curse of Dimensionality : O(4K) independent dimensions. • Statistical noise increases with decreasing fragment lengths. • Project data into a lower dimensional space to decrease noise. • Principal Component Analysis.

  16. PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

  17. Effect of Skewed Relative Abundance Abundance 20:1 Abundance 1:1 B. anthracis and L. monogocytes

  18. A Weighting Scheme For each read, find overlap with other sequences

  19. A Weighting Scheme 4 5 5 3 Calculate the redundancy of each position. Weight is inverse of average redundancy.

  20. N å = - - T M w (X μ ) (X μ ) w i i w i w = i 1 Weighted PCA • Calculate weighted mean µw : • Calculates weighted co-variance matrix Mw • PCs are eigenvectors of Mw. • Use first three PCs for further analysis. N å w X i i = = μ i 1 w N

  21. Weighted PCA separates species PCA Weighted PCA B. anthracis and L. monogocytes : 20:1

  22. Un-supervised Classification ?

  23. Semi-Supervised Classification • 31 Marker Genes [courtesy Martin Wu] • Omni-present • Relatively Immune to Lateral Gene Transfer • Reads containing these marker genes can be classified with high reliability.

  24. Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

  25. The Semi-supervised Normalized Cut Algorithm • Calculate the K-nearest neighbor graph from the point set. • Update graph with marker information. • If two nodes are from the same species, add an edge between them. • If two nodes are from different species, remove any edge between them. • Bisect the graph using the normalized-cut algorithm.

  26. Apply algorithm recursively Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  27. Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  28. Testing • Simulate Metagenomic Sequencing • Sanger Reads • Variables • Number of species • Relative abundance • GC content • Phylogenetic Diversity • Test on a “real” dataset where answer is well-established.

  29. Results

  30. Conclusions/Future Directions • Satisfactory performance • No Training on Existing Genomes  • Sanger Reads  • Low number of Species  • Future Work • Holy Grail : Complex Communities • Semi-supervised projection? • Hybrid Assembly/Binning

  31. Acknowledgements UC Davis UC Berkeley LiorPachter Richard Karp AmbujTewari Narayanan Manikandan • Jonathan Eisen • Martin Wu • Dongying Wu • Ichitaro Yamazaki • Amber Hartman • Marcel Huntemann • Princeton University • Simon Levin • Josh Weitz • Jonathan Dushoff