1 / 20

Creating the Genomic Encyclopedia for Bacteria and Archaea

Creating the Genomic Encyclopedia for Bacteria and Archaea. Rick Stevens Eddy Rubin Argonne National Laboratory Joint Genome Institute The University of Chicago Berkeley Lab. Rob Edwards, Jonathan A. Eisen, Ross Overbeek, George Garrity,

xaria
Download Presentation

Creating the Genomic Encyclopedia for Bacteria and Archaea

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating theGenomic Encyclopedia for Bacteria and Archaea Rick Stevens Eddy Rubin Argonne National Laboratory Joint Genome Institute The University of Chicago Berkeley Lab Rob Edwards, Jonathan A. Eisen, Ross Overbeek, George Garrity, Veronika Vonstein, Sveta Gerdes, Folker Meyer, Kevin White, Tim Lilburn, Barney Whitman, et. al.

  2. The Basic Idea of the Project • To build an enterprise that can take advantage of the expected exponential improvements of sequencing capabilities to sequence “all known” cultured and described prokaryotes • Ride the expected “Moore’s law” of sequencing capability • To develop a distributed high-throughput “industrial” approach to the cultivation, characterization, sequencing, annotation and analysis of prokaryotic genomes • Build a team from groups that have expertise and track records • To build and curate a database of genome sequences, metabolic reconstructions, and standardized phenotype assays associated with each target organism • Streamline the release of data, provide a foundation for derivative projects

  3. Concept of the Bergey’s/GEBA Sequencing Project • A Fixed cost annual investment • Each year more can be sequenced as sequencing costs decrease and as cultivation efficiencies improve based on experience • Leverage the expected improvement of sequencing costs • Address the overall scope within 5 to 6 years • Increase amount of near complete sequences per year • Optimize the choice of organisms to maximize diversity at each stage • Exploit the Bergey’s Trust and International Committee on Systematics for Prokaryotes for Taxonomic coverage (e.g. Garrity and Whitman) • Involve the microbiology community for prioritization • Industrialize the pipeline • Biological Resource Centers to produce and characterize type material • DOE JGI, NIAID/DMID Centers, NSF/USDA Centers for Sequencing • Laboratories for bioinformatics (Argonne, JGI, TIGR, ORNL, etc.) • Universities and Laboratories for modeling and analysis

  4. The Question is not if, but When and How ? • Why should we want to accelerate this transition? • Why not just let it happen as a matter of course? • What is in the current sequencing pipeline? • Completed Genomes Ongoing/In the Pipeline • Archaeal 29 56 • Bacterial 397 991 • Eukaryal 44 631 • The existing process of bottoms up selection of organisms for sequencing is leaving many important groups underrepresented, closure will take a long time • There are groups are well represented in the literature, but not in the sequencing databases • Under representation is also an issue in environmental sequencing data

  5. Tapping into prokaryotic biodiversity - Industrial Biotechnology • • Rapidly growing field • • by 2010 biocatalysis will be used in production of 60% of fine chemicals (McKinsey analysis) • • In US coordinated by USDA Biobased Products and Bioenergy Coordination Council (BBCC) • • Applications: • pharmaceuticals • food ingredients (sweeteners, vitamins) • feed additives and other agrochemicals • organic solvents • polymer raw materials • biofuels • • Advantages over chemical methods: • • exquisite substrate specificity • • excellent chemo-, regio- and stereoselectivity • • environmentally friendly “green chemistry” based on biorenewables • • Needed: • • novel enzymes and pathways • • “Periodic table” of biochemical transformations Straathof et al. 2002. Curr Opinion Biothech 13:548-56 ~150 compounds are currently produced on industrial scale using biocatalysts. Examples: Hans E. Schoemaker, et al. 2003. Science 299:1694-97

  6. • Hydoxylaminobenzene mutase • Aldoximine dehydratase • Azetidine-2-carboxylate hydrolase • Benzylsuccinate synthase • Phenylboronic acid oxygenase Analysis of 1000s of new bacterial genomes will likely yield completely novel pathways and enzymes for industrial applications Examples of recently discovered biocatalytic transformations of novel organic functional groups: • • Current approaches to discovery of new enzymes: • Screening environmental samples by enrichment cultures (BUT: only <<1% prokaryotes are currently culturable) • Metagenome approach: cloning & expression of DNA samples in a surrogate host, then screening for desired function (BUT: only known functions can be screened for, new biochemistry cannot be discovered) • Sequence-based discovery (growing explosively, generating knowledge base for basic sciences and biotechnological applications) L.P. Wackett. 2004. Current Opinion in Biotechnology, 15:280–284 Still to be discovered: enzymes involved in the biosynthesis or catabolism of approximately 40 naturally occurring chemical functional groups are still not known

  7. Building the Case • There is a disparity between the literature and the existing genomes • We can’t fully exploit the community’s historical knowledge and investments without closing this gap • There is a disparity between the rank/abundance curves from 16s studies and from environmental sequencing projects and the existing genomes • We can’t fully understand the new datasets without closing this gap (I.e. lack of complete sequence coverage of known culturables is holding back future work) • There is likely to be new biochemical pathways and novel enzymes in the set of culturable but unsequenced organisms, sequencing non-cultured organisms to expand diversity • These represent the low hanging fruit for discovery since the investment has already be made in determining culture conditions • A comprehensive database produced under controlled conditions that includes phenotype data and genotype data will accelerate research in understanding the genotype-phenotype relationship • Genome-Scale reconstruction and modeling will be dramatically accelerated by comprehensive databases that include phenotype data

  8. Estimated Sequencing Rates Selection of Targets Produce DNA Sequencing Assembly Rapid Annotation (24 Hours) Database Repository Phenotype Prediction Model Generation Metabolic Reconstruction

  9. Technical Feasibility FAQ • How many genomes would the project propose to sequence? • About 5000 over 5-7 years • Who would produce the biomass needed for DNA extraction? • Type culture centers until enrichment and environmental methods mature • Will the biomass/DNA be available for distribution? • Yes, both the DNA and the libraries could be stored for distribution • What throughput is needed for DNA production? • In the beginning of the project ~300 taxa per year to 2000 per yr at the end • What combinations of sequencing technologies need to be employed? • Sanger and Pyrosequencinginitially, others as they come online • What throughput is needed for annotation? • 24 hour turnaround from assembled sequence to initial availability this has already been achieved at Argonne, TIGR and elsewhere • Is is possible to have a standard set of phenotype assays given the broad spectrum of organisms and conditions? • We are considering Biolog as a model, but it is too limited • How would the genomes be selected and prioritized? • At each cycle we choose genomes (e.g. via 16s) to minimize the diversity gaps • Community input would be solicited to insure the project is tracking the communities interests • Is it necessary to “close” the genomes? • We think no.Libraries would be archived for groups that might be interested in closing.

  10. The Project Would Provide a Comprehensive Set of Genome Sequences for: • Biofuels, and bioproduction of alternative feedstocks • Understanding and managing the microbial carbon cycle • Soil and subsurface microbial ecology • Bioremediation and bioconversion of waste streams • Evolution and microbial ecological dynamics • Context for environmental sequencing and metagenomics • Basis for developing predictive models of phenotypes • Source of components for synthetic biology • Improving our understanding of cultivability • Dramatically improving the reliability and quality of genome annotations

  11. How Many Known Cultured Organisms? • Latest version of the Prokaryotic Taxonomic Outline will contain 7951 named species of Bacteria and Archaea. • Of these, 178 are non-cultivable or not represented by viable type material. • An additional 1222 are synonyms. • Of the 6543 type strains for which viable material is reportedly deposited, we have assembled a minimal set of 6389 strains that are available from 16 major public culture collections or biological resource centers in the US, Europe, and Asia. • The remaining 154 are in minor or non-public collections. • This information is derived from Release 6.1 of the Taxonomic Outline of the Prokaryotes which will be published in 2007 and is current through May 2006.

  12. What Has Been Sequenced or is In Play • Of the 6400 strains available from public sources • About 380 are human, animal or plant pathogens • Order 1/3-1/2 of the known pathogens have been sequenced • 360 complete prokaryotic genomes published • 56 archaeal and 940 bacterial genomes in progress • From 897 prokaryotic genomes in progress in GOLD • ~400 are pathogens (many duplicate taxa) • ~221 are supported by DOE (156 biotech, 51 environment) • Approximately ~5000 prokaroytes not yet in play • We estimate about 4800 non-pathogen taxa

  13. Strain Distribution in Collections US Collections / BRCs Strains American Type Culture Collection (ATCC) 4027 USDA ARS Collection (NRRL) 223 European Collections Deutsche Sammlung vor Microoransmen (DSMZ) 1302 Culture Collection University Gottenberg (CCUG) 183 Pasteur Institute (CIP) 170 Laboratory for Micrbiology, Gent (LMG) 101 National Collection of Industrial and Marine Bacteria 25 French Collection of Phytopathogens (CFPB) 15 National Collection of Type Cultures (NCTC) 12 National Collection of Phytopathogenic Bacteria 11 Asia Japan Collection of Microorganisms (JCM) 185 Institute of Fermentation, Osaka (IFO) 34 Korean Collection of Type Cultures (KCTC) 28 Institute of Applied Microbiology, Tokyo (IAM) 26 National Institute of Technology And Evaluation (NBRC) 24 All-Russian Collection of Microorganisms (VKM) 13

  14. Distribution of Genome Sizes in the Pipeline Average Sequence ~ 4Mbp

  15. Getting Value from the Genomes • Genomes would be assembled by the groups doing the sequencing • Assembled contigs would be sent to the initial high-throughput annotation server for draft annotations and immediately published on-line • The accumulated (additional) genomes will be used to improve annotations (gene calls, functional coupling) • Genomes will be integrated into databases to support comparative analysis and evolutionary analysis • Annotated genomes can be used to semi-automatically construct genome-scale models which could be used to make metabolic phenotype predictions

  16. online at http://www.sequencingbergeys.org login required (just ask us) guest read-only access after the meeting? make maximum information available Bergey hierarchy, NCBI taxonomy, 16s RNA, strain collections, GOLD, SEED, Background

  17. List of organisms for sequencing - based on 16s clusters

  18. Cluster Page select strain for cluster

  19. “Bergey” Browser

  20. Species Page

More Related