1 / 33

Steven L. Salzberg The Institute for Genomic Research and Johns Hopkins University

Data Management in a High-Throughput, Science-based Genome Center NIGMS Protein Structure Initiative Workshop on Data Management. Steven L. Salzberg The Institute for Genomic Research and Johns Hopkins University. How can you run 50 projects in parallel and: Maintain production

morley
Download Presentation

Steven L. Salzberg The Institute for Genomic Research and Johns Hopkins University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management in a High-Throughput, Science-based Genome CenterNIGMS Protein Structure Initiative Workshop on Data Management Steven L. Salzberg The Institute for Genomic Research and Johns Hopkins University

  2. How can you run 50 projects in parallel and: • Maintain production • Generate consistent, high-quality data • Share data and software with the scientific community • Publish research of the highest quality • Adapt quickly to new technologies

  3. Genomes completed and published by TIGR and our collaborators, 1995-present Organism Reference Arabidopsis thaliana Lin et al., Nature 402: 761-8 (2000) Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997) Bacillus anthracis AmesRead et al.,Nature 423: 81-86 (2003) Bacillus anthracis Florida Read et al.,Science 296, 2028-33 (2002) Borrelia burgdorferi Fraser et al.,Nature 390: 580-586 (1997) Brucella suis Paulsen et al.,PNAS 99 (2002) Caulobacter crescentus Nierman et al., PNAS 98 (2001) Chlamydia pneumoniae Read et al.,Nucl. Acids Res. 28, (2000) Chlamydia muridarum Read et al.,Nucl. Acids Res. 28, (2000) Chlamydophila caviae Read et al.,Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al.,PNAS 99: 9509-9514 (2002) Coxiella burnetii RSA 493Seshadri et al., PNAS 100: 5455-60 (2003) Deinococcus radiodurans White et al.,Science 286 (1999) Enterococcus faecalis Paulsen et al.,Science 299: 2071-2074 (2003) Haemophilus influenzae Fleischmann et al., Science 269, (1995) Helicobacter pylori Tomb et al.,Nature 388:539-547 (1997) Methanococcus jannaschii Bult et al.,Science 273:1058-1073 (1996) Mycobacterium tuberculosis Fleischmann et al.,J. Bact.184, (2002) Mycoplasma genitalium Fraser et al.,Science 270:397-403 (1995) Neisseria meningitidis Tettelin et al.,Science 287 (2000) Oryza sativa (rice) chr 10Wing et al., Science 300: 1566-1569 (2003) Plasmodium falciparum Gardner et al.,Nature 419:531-534 (2002) Plasmodium yoelii Carlton et al.,Nature 419:512-519(2002) Porphyromonas gingivalis Nelson et al.,J. Bact., in revision. Pseudomonas putida Nelson et al.,Envir. Microbiol. (2002) Shewanella oneidensis Heidelberg et al.,Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al.,PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al.,Science 293 (2001) Sulfolobus islandicus virus Arnold et al.,Virology 15:252-66 (2000) Thermotoga maritima Nelson et al.,Nature 399: 323-329 (1999) Treponema pallidum Fraser et al.,Science 281: 375-388 (1998) Vibrio cholerae Heidelberg et al.,Nature 406, (2000)

  4. Genomes in progress or recently completed Acidithiobacillus ferrooxidans Bacillus anthracis Kruger B Burkholderia mallei Clostridium perfringens ATCC13124 Dehalococcoides ethenogenes Desulfovibrio vulgaris Ehrlichia chaffeensis Ehrlichia sennetsu Geobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatus Mycobacterium avium 104 Mycobacterium smegmatis Pseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticola Wolbachia sp. Anaplasma phagocytophila Bacillus cereus 10987 Bacteroides forsythes Brucella ovis Baumannia cicadellinicola Campylobacter jejuni Carboxydothermus hydrogenoformans Colwellia sp. 34H Dichelobacter nodosus Fibrobacter succinogenes Prevotella intermedia Pseudomonas fluorescens Silicibacter pomeroyi DSS-3 Streptococcus agalactiae A909 Streptococcus gordonii Streptococcus mitis Streptococcus pneumoniae 670 Acidobacterium capsulatum Bacillus anthracis A01055 Bacillus anthracis A0402 Bacillus anthracis Ames 0581 Burkholderia thailandensis Campylobacter coli RM2228 Campylobacter upsaliensis RM3195 Clostridium perfringens SM101 Epulopiscium fishelonii Hyphomonas neptunium Listeria monocytogenes F6854 Listeria monocytogenes H7858 Mycoplasma arthritidis Mycoplasma capricolum Myxococcus xanthus Prevotella ruminicola Pyrococcus furiosus Verrucomicrobium spinosum Actinomyces naeslundii Bacillus anthracis A0071 Bacillus anthracis Kruger B Erwinia chrysanthemi Gemmata obscuriglobus Mycobacterium tuberculosis Ruminococcus albus Streptococcus sobrinus Aspergillus fumigatus Brugia malayi Coccidioides immitis Cryptococcus neoformans Entamoeba histolytica Oryza sativa Chromosome 3 & 10 Plasmodium vivax Schistosoma mansoni Solanum spp. Tetrahymena thermophila Toxoplasma gondii Theileria parva Trichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi

  5. Gene finding Library construction Homology searches Colony picking Function assignments Template preparation Metabolic pathways Gene families Sequencing reactions Comparative genomics Base calling Transcriptional/ translational regulatory elements Repetitive sequences Sequence files A Whole-Genome Shotgun Sequencing Project Data release Downstream research Shotgun sequencing Genome Assembly Annotation Assembler-> Genome scaffold Publication www.tigr.org Combinatorial PCR POMP Microarray studies Ordered contig set Vaccine, drug development Gap closure sequence editing Human disease studies Re-assembly ONE ASSEMBLY! (per molecule) LIMS entry point

  6. Sequence Data Management • Professional software engineers • Continual contact with lab staff • Separate research staff • Computational research, separate from production “pipeline” • Genome assembly • Gene finding • Sequence alignment • Biology/genomics research staff

  7. Joint Technology Center • TIGR doubled its sequencing capacity in a 2-month period, Dec-Jan 2002-3 • We moved our entire facility to a new building and tripled its capacity in June-July 2003 • All databases, network connections, LIMS software continued operating smoothly throughout

  8. Sequence LIMS Processes at TIGR Colony Plate Culture Plate DNA Plate Reaction Plate Chromatogram Files DNA Sequencer (ABI 3730xl)

  9. LIMS-Database Interactions at TIGR(circa 2001) One database per sequencing project.... Tracker Create/Edit Rxn Sheet Tracker Create Gel Sheet Gel Sheet Maker Uploader Map Ricky library template sample reaction gel ----------- sequence feature bases library template library template sample reaction library template sample reaction gel library template sample reaction gel library

  10. Finishing Center – Sequencing Center Data Interchange (mid-2003) • Reads • - bases • - Quality • - Chromatograms + positions • - Revision • - Trimming info • - Insert mapping/pairing • - Chemistry, read end, etc • Library info (size estimators) • Vectors used DNA Finishing Center (FC) Data Sequencing Center (SC) • Yield • Randomness • Percent good quality • Percent contaminant Reaction Lists On existing clones QC Insert Id Primer

  11. IT Support • High-quality computers and systems support is absolutely critical • At the same time, IT support should be invisible (ideally) • TIGR has 15 full-time, professional IT staff • Systems administrators • Database administrators • Web administrators • Network administrators • Desktop support

  12. IT Infrastructure • 10 Compaq Alpha ES40s, max 32 GB RAM • high-end computing • 15 UltraSPARC and SunFire servers • database and web services • 400 Pentium-based Linux computers • grid computing • Gigabit backbone network • Network-attached file storage • NetApp, EMC

  13. IT example: Grid computing facility Pool Compute Cycles Owner Compute Cycles January 2001, 1-week snapshot

  14. High-throughput, automated annotation • 10 bioinformatics engineers maintain software pipeline • Can completely process a bacterial genome in one day • Manage all data uploads to GenBank • Specialized analyses for publications

  15. Manual annotation: ~10 genes / day • Eight bacterial genome annotators • Inspection of: • Search results • TIGRFam matches • Experimentally characterized gene • Literature references – abstracts and more • Assignment of: • Common name • Role category • Genetic name • EC number

  16. Genome Annotation Processes Website dbase ...plus 10 more annotators

  17. Manatee: a collaborative tool • Manual Annotation Tool, Etc Etc… • Open Source: manatee.sourceforge.net • Based on Chado relational schema • Several installations • one week to install • Fully documented • API • User manual • Installation • Testing • Unit, integration testing • Deployment • Quarterly training classes

  18. Gene Identification Information Gene Ontology and Cellular Role Graphical Display of Analyses Textual Display of Analyses Gene Information Page

  19. Experimentally characterized proteins indicated by color Pair-wise Alignment Summary

  20. Summary of Genome Information

  21. Gene Information Page Online help system

  22. Annotation pipeline Chugga Chugga { • Published: 33 • Completed: 18 • Closure: 20 • High-throughput sequencing: 22 • Library construction: 19 • Trend: more closely related genomes TigrDB Gene coords Seq/pep files Search results Families GenBank

  23. Annotation research example: position effect

  24. 00678-00683 04154-04155 08198-08203 TonB-dependent receptor multidrug 00524 05192 00373-00377 04337-04343 04248 05470 01013 04981 05526-05527 00476 07628 05701- 05703 00494 02848 02916 04356 04535 04686 06303 07595 07737 02071 06334 00027 ? H+ H+ B00305 08304 03406 03893 04335-04336 04388 05169 00565-00568 01690 01892 01902 04322-04323 04864 05733 07001-07002 07241 05110-05113 00626-00631 05788 B00086 00168 01721 01781 02322 02588 02605 03229 03518 04504 04743 05622 06186 06981 B00210 03347 04123 05758 03386 03589 04499 05215 01170 01318 01568 01578 01890 02803 02868 03243 03257 03267 03551 03685 04444 04701 04979 05060 05596 06205 06925 07144 07154 07469 07689 07703 07985 08117 08125 02470-02476 urea multidrug ? phosphate multidrug H+/ Na+ molybdate polysaccharide arsenite chromate nitrite 00311 00787 00986 01144 01646 01720 00311 00455 01502 02536 03296 03895 03978 05254 05256 05407 05409 05834 06499 06501 07410 07774 04758 ? ? ammonium (27) nitrate ? sulfate 02291 ? H+ 06869 Na+ H+ H+/Na+ sulfate H+ phosphate 02490 H+ u H+ H+ H+ H+ 05211 04017 chloride ? 02173 02731 02755 03497 07398 B00094 nicotinamide mononucleotide Na+ purine/ cytosine/ allantoin ATP ADP ATP ADP ATP ADP ATP (2) ATP (2) (9) ATP ADP ADP ATP ADP (22) ADP ATP (4) ADP (6) (14) sugar (5) (3) (4) 02882 03200 04513 04782 05593 H+ (17) (2) (5) (2) (3) (3) 01500 01515 02248 07340 B00058 B00069 03300-03301 00620-00624 06089-06091 02852-02858 03067-03069 03108-03110 04086-04090 05258-05260 06308-06312 07083-07084 00100 00139 06273 07382 02334 (2) H+ CELLOBIOSE STARCH 08223-08226 02762-02766 03450-03451 GLYCOLYSIS xanthine/ uracil fructose-1- phosphate (6) GLYCOGEN fructose IIC 2141-2150 00665, 07014, 07012, 07257, 07262 (6) 6280,2133 H+ Glucose-6-P+Glucose CELLULOSE 4374 pyruvate IIB ornithine 4371-4379 03773 03776 04685 04749 07416 03997 04987 02876 06207 06898 06930 06933 IIA sulfite 1058 (2) THIOSULFATE PTS arginine HPr 0894-6 5053 GLUCOSE polyhydroxyalkanoate GLUCONATE OprB- like porin amino acid? PEP (3) EI 0556 ATP+PPn ADP+PPn-1 (7) 06808 3132 AROMATIC SULFONATES H+ glycerol 0101 6913 CO2+H20 H2CO3 02733 03780 02734 05088 02783 06548 02885 07068 03094 08246 03143 RIBOSE B00055 amino acid? 6891,6892 LysE family 6-P-gluconate water Glucose-6-P 4637 8167-74 (11) 04642-04645 04175-04184 1408 H+ ATP 0259 GLUCOSAMINE-6-P 6987-92,3275, 8077-84, 0881-83 Fructose-6-P ribose (2) serine ED and PPP 7260 histidine 02843 5704-5 ADP + sulfite 04773 MANNOSE1P Fructose-1,6-P 2159,7870,0657,6915, 5092,0967,6890,4637 branched chain amino acids ATP FRUCTOSE 7260 0842 0974 1387 chorismate sugar? (2) 01943 OH OH 5715 2841 Na+ ADP 7859-67 5374 0127-29 04940-04948 06899-06095 Glyceraldehyde-3-P + Dihydroxyacetone-3-P glutamate phenylalanine 00039 03263 06942 PHENOLSULPHATE 6809-13 GLYCEROL sialic acid (3) 6437 tryptophan -ketobutyrate, methanethiol, NH3 METHIONINE H+/Na+ 3226 01629 6917 5120 H+ 0418 glutamine GLUTAMATE proline 1,3-biphosphoglycerate galactonate 04052 04349 05297 (2) 0993 00995 01724 glutamate PROLINE (3) Na+ 0970 0651,5718,1053 3-phosphoglycerate serine H+ 0776 GLUTAMINE + sulfate alanine/ glycine glutamate 07749 glucarate 0849-0858 HISTIDINE glutamate 0820 Na+ 00303 00810 02293 02778 04285 08186 02823 2-phosphoglycerate 0606-07,0438,2771, 7841,1801,7947,7430, 1221,0748,2550 GLUTAMATE 1785 ornithine,proline PHENYLALANINE H+ tyrosine 5961 glycine betaine choline hexuronate LEUCINE, VALINE ISOLEUCINE 0252,0660,3055,6294,1450 2632,1452,0694,2975 8147,8148 (6) PEP 04419 OAA 2825,1370,6077, 6068-9,0533-36 H+ 5232 0085 6351 H+ lysine gluconate/ idonate 02903 03121 07493 histidine (3) 5911,1348 0368,0955,8072 8024-29 (D)+(L)-LACTATE SERINE SARCOSINE Pyruvate GLYCINE 7467,8030 07482 H+ H+ 7582 H+ 6755 6759 03532 05304 05642 02727 04554 04555 06330 aromatic amino acids 0592-0596 6952-6956 benzoate/ 4-hydroxy- benzoate ALANINE 7653-6 (7) Acetate ACETOIN 00854 01778 07051 cysteine 0066,6165,7479 threonine (3) 4809 5030 06835 00372 01909 02717 03929 04472 04715 B00258 CO2+ NH3 Acetyl-P H+ 7296 0743,1598,0144,1523 H+ methionine Homoserine Acetyl-CoA 3099,1553-5, 3712,5278 05831 TYROSINE 1320-1322 proline STACHYDRINE proline muconate (8) 5375 1815 6167 H+ 01316 02067 03937 04521 08097 B00223 1954 malonate? OXOPROLINE 02891 03688 07052 glutamate CHOLINE SULFATE (3) 2637 CITRATE ASPARTATE OXALOACETATE GABA H+ 0135 (6) 4115 0169 betaine aldehyde betaine CHOLINE 01313 03166 04064 tartrate? H+ glutamate (3) H+ Isocitrate ethanolamine Malate 7675-78 00024 05268 01831 04267 ETHANOLAMINE 1930,4834,5182, 4827,4833,1650, 2213,2214,1937, 1938,1941,1940, 1931-1935,7020, 7096,5742,1332, 7329,6544,3911, 7491 acetaldehyde 7672 citrate ethanol 07674 H+ 2276,5455,8021 (4) TCA and GLYOXYLATE BYPASS FORMALDEHYDE FORMATE CO2 00421 01168 04113 04449 06528 06581 7759-7764 5068-5071 H+ amino acids 00031-00034 00866-00869 01324-01329 02623-02625 02830-02838 06448-06454 07487-07489 08101-08106 08188-08191 B00245-B00248 formate (6) H+ 06693 leucine H+ 6683 7750 6889,5386-90,2975 0359 2-KETOGLUTARATE Fumarate ATP ASPARAGINE ASPARTATE lactate 0606-7,0438,2772,7841, 1801,7947,2874,1372-74, 6805,6936, 6789,8270 arginine amino acid (10) 5748 01350 H+ 00942 02785 04312 00691 04663 04955 05270 06650 06285 ADP dicarboxylate ACC (3) 1801-1812 (6) Succinyl-CoA 1835-6 ARGININE Succinate NOPALINE branched chain amino acid ATP H+ /Na+ Glutamate (6) 2063 8213 8217 7630 5055 7149 P-type ATPases F-type ATPase H+ 8184 01120-01128 01159-01165 04161-04164 04192-04201 06712-06718 07539-07546 ADP 00501 03286 07394 01754 02299 06628 TAURINE 06670-06773, 02296 00184-00187 00290-00292 04717-04720 ATP (6) 08179-08182 08294-08297 GABA UREA,ORNITHINE,PUTRESCINE 4033-44 CO2+H20 ADP ATP dicarboxylate (9) 01513 00486 00797 01710 03213 04211 05781 06366 07209-07212 (5) (2) (2) (2) (6) ADP ADP ADP 00208 01283 03870 05226 ATP ADP ATP ATP ATP ADP ADP ADP ATP ATP ATP ADP 01097-01099 00589-00590 04338-04443 04704-04707 B00115-B00117 01432-01435 ATP ADP 00356 02446 02641 03563 03999 05057 (3) (8) (3) H+ (2) (7) (3) (6) (4) 03187 ADP ATP 2-ketogluconate 00189 00279 00281 00297 00674 04361 07600 B00030 01979-01982 amino acids/ glycine betaine 05850 (6) oxalate 00137 00647 00981 08059 04312 05762 00077-00081 01791-01796 02897-02899 06816-06821 08073-08075 07730-07733 06207 02876 06746 06743 04268 07118 00496 00183 02775 07251 04190 08124 01828 05266 02566 06252 03169 02729 02324 04559 06315 06665 02606 formate (2) 03423 (6) (2) 02292 02183 00799 00329 05995 00950 06173 H+ 05784 07705 Cd2+/Cu2+/ K+/Mg2+ 00446 opines taurine polyamines H+/ Na+ K+ metal cation 02950 03439 H+ Na+ Mg2+/ Co2+ H+ Zn2+ mechanosensitive ion channel peptides K+ Ni2+ H+ Na+ Cu2+? iron chelate/ hemin Mg2+ 00242-00252 00297 05041 06871 00156 03521 B00088 00069-00072 02516-02519 01833- 01839 03431-03435 03232-03242 00612-00617 02492-02500 05790-05793 06140-06147 07135-07142 07871-07877 07115-07126 01997-02004 03700-03703 alkyl phosphonate 01819 (23) 00194 05004-05011 05606 03870 OprP porin H+ metal ion BenF/PhaK/OprD porins

  25. TIGRFams • Heavily curated multiple alignments based on protein families of the same function. • Proposed “cure” for transitive annotation. • Based on Hidden Markov Models (HMMs). • approaching 2,000 families. • Complete assignments to Gene Ontology • Cutoff scores for each family • Trusted (automated name assignment) • Noise (manual inspection required) • Downloadable. Fully integrated into the Interpro database

  26. TIGRFAMs: Genome coverage

  27. Multiple Genome Annotation • Genes  usual pipeline • blast, Pfam, COG, TIGRFam, Interpro, etc • Cluster genes based on common properties using hierarchical clustering • Display • Can annotators select grab subsets of genes for reliable assignment?

  28. Malate dehydrogenase Lactate dehydrogenase

  29. Sybil Comparative System • Open Source software • Complete, portable data management system for genome annotation • Chado relational database (developed by FlyBase and other collaborators) • Extensive graphical interface • Priority “use case”: management of genes and genomes for identification of pathogen-related genes

  30. Sybil Overview

  31. A B C Tabular views of conserved synteny,orthologs,blast matches across multiple genomes

  32. Open Source Softwareat TIGR  Perl Artistic License • Manatee • Sybil (prototypes) • Annotation Engine • MUMmer (large-scale genome alignment) • BAMBUS (Assembly/scaffolding) • Glimmer, GlimmerM, Exonomy (Gene finding) • TM4 (Microarray tools) • Chado/BSML

  33. Conclusions • Professional software and support staff are critical to a large, high-throughput research project • Scientists benefit from frequent interactions with production line staff • High-quality support allows scientific staff to devote more effort to scientific discovery

More Related