1 / 64

Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?]. Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research. Outline.

huyen
Download Presentation

Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges for metagenomic data analysis and lessons from viral metagenomes[What would you do if sequencing were free?] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research

  2. Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?

  3. Why Metagenomics? • What is there? • How many are there? • What are they doing?

  4. How do you sequence the environment? • Extract DNA

  5. CsCl step gradient 1.1 g ml-1 1.35 g ml-1 1.5 g ml-1 1.7 g ml-1

  6. CsCl step gradient

  7. How do you sequence the environment? • Extract DNA • Create library

  8. David Mead - Linker-Amplified Shotgun Libraries (LASLs) Soil Extraction Kit This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA Breitbart (2002) PNAS

  9. How do you sequence the environment? • Extract DNA • Create library • Sequence fragments

  10. Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?

  11. Why Phages? • Phages are viruses that infect bacteria • 10:1 ratio of phages:bacteria • 1031 phages on the planet • Specific interactions (probably) • one virus : one host • Small genome size • Higher coverage • Horizontal gene transfer • 1025-1028 bp DNA per year in the oceans

  12. Uncultured Viruses 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence

  13. Bioinformatics • BLASTagainst NR • blastx, tblastn, tblastx • BLAST against boutique databases • Complete phage genomes, ACLAME, Other libraries, 16S • Parsing to present data in a useful format

  14. BLAST and Parsing • http://phage.sdsu.edu/blast • Submit BLAST to local and remote databases • Local (as fast as possible) • NCBI (one search every 3 seconds) • Many concurrent searches • One search versus 1,000 searches • Parse data into tables • Access to taxonomy etc

  15. Most Viral Genes are Unknown Known 22% Unknown 78% TBLAST (E<0.001) 3,093 sequences Breitbart (2002) PNAS Rohwer (2003) Cell

  16. GenBank has more than doubled since 2001 … 60 billion base pairs 60 million sequences

  17. GenBank has more than doubled since 2001 …but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol.

  18. All of the new genes in the databases are coming from environmental sequences

  19. Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?

  20. Human-associated viruses • More bacteria than somatic cells by at least an order of magnitude • More phages than bacteria by an order of magnitude • Sample the bacteria in the intestine by sampling their phage

  21. Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% Phages 94% TBLAST (E<0.001) 532 sequences Breitbart (2003) J. Bacteriol.

  22. Adults Versus Babies No bacteria or viruses in 1st fecal sample Abundant bacterial and viral communities by 1 week of age • >108 VLP/g feces

  23. Baby Feces Viruses • Most sequences are unknown (≈70%) • Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts • From microarray studies, sequences are stable in the baby over a 3 month period • Same types of phage as present in adult feces • one identical sequence to an unrelated adult!

  24. DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses?

  25. Most Human RNA Viruses are Known Other Plant Viruses 9% Unknown 8% Other 26% Pepper Mild Mottle Virus 65% Known 92% TBLAST (E<0.001) ≈36,000 sequences Zhang (2006) PLoS Biology

  26. Pepper Mild Mottle Virus (PMMV) • ssRNA virus; ≈6 kb genome • Related to Tobacco Mosaic Virus • Infects members of Capsicum family • Widely distributed – spread through seeds • Fruits are small, malformed, mottled • Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http://www.rothamsted.bbsrc.ac.uk/ppi/links/pplinks/virusems/

  27. S1 S2 S3 S4 S5 S6 S7 S8 S9 PMMV PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV San Diego : 78% people are positive Singapore : 67% people are positive 10-50 fold increase in feces compared to food 106-109 PMMV copies per gram dry weight of feces

  28. Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS Hong Kong chili sauce Hong Kong green chili Pork noodle red chili Vegetarian chili Indian curry Chinese food Chicken rice

  29. Koch’s Postulates Thesunmachine.net http://www.sweatnspice.com

  30. Human microbial metagenome is more important than human genome

  31. Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?

  32. How do you sequence the environment? • Extract DNA • Create library • Sequence fragments Everything so far from 40,000 sequences This is so 2004

  33. How do you sequence the environment? • Extract DNA • Pyrosequence

  34. 454 Pyrosequencing • DNA extraction from environment • Whole genome amplification } • Emulsion-based PCR • Luciferase-based sequencing SDSU } 454 Inc. Margulies (2005) Nature

  35. 454 Sequence Data(Only from Rohwer Lab) • 21 libraries • 10 microbial, 11 phage • 597,340,328 bp total • 20% of the human genome • 50% of all complete and partial microbial genomes • 5,769,035 sequences • Average 274,716 per library • Average read length 103.5 bp • Av. read length has not increased in 7 months

  36. Growth of sequence data 600 million bp 6 million reads

  37. Cost of sequencing • One reaction = $10,000 • One reaction = 250,000 reads • 250 reads = $10 • 1 read = 4¢ • 1 read = 100bp • 1 bp = 0.04¢ ($400 per 1x 1,000,000 bp) • Sanger sequencing ca. $1/rxn, 0.2¢/bp • real cost ca. $5/rxn, 1¢/bp 454 sequencing does cot require cloning, arraying etc.

  38. Bioinformatics • 597,340,328 bp total • 5,769,035 sequences • 7 months • Existing tools are not sufficient

  39. Current Pipeline http://phage.sdsu.edu/~rob/Pyrosequencing/ • Dereplicate • BLAST against • 16S • Complete phage • nr (SEED) • subsystems

  40. Sequencing is cheap and easy. Bioinformatics is neither.

  41. Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?

  42. The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced

  43. Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16S are indistinguishable Cloned Red Red

  44. Annotation of metagenomes by subsystems A subsystem is a group of genes that work together • Metabolism • Pathway • Cellular structures • Anything an annotator thinks is interesting

  45. There are different amounts of metabolism in each environment

  46. There are different amounts ofsubstrates in each environment Red Stuff Black Stuff

  47. But are the differences significant? • Sample 10,000 proteins from site 1 • Count frequency of each subsystem • Repeat 20,000 times • Repeat for sample 2 • Combine both samples • Sample 10,000 proteins 20,000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review

More Related