1 / 57

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources. A Field Guide part 1. September 29, 2004 ICGEB. NCBI Resources. About NCBI The NCBI Entrez System NCBI Sequence Databases NCBI Genomic Resources ** Intermission **

lynde
Download Presentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI Molecular Biology Resources A Field Guide part 1 September 29, 2004 ICGEB

  2. NCBI Resources • About NCBI • The NCBI Entrez System • NCBI Sequence Databases • NCBI Genomic Resources ** Intermission ** • NCBI Precomputed Resources • Behind the scenes

  3. Bethesda, MD The National Institutes of Health

  4. The National Center for Biotechnology Information • Created as a part of NLM in 1988 • Establish public databases • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

  5. Christmas & New Year’sDays Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 Currently averaging 10,000,000 to 35,000,000 hits per day!

  6. Countries of Origin

  7. Web Access:http://www.ncbi.nlm.nih.gov

  8. http://www.ncbi.nlm.nih.gov/About/index.html

  9. Part 2. Data Flow and Processing Part 1. The Databases Part 3. Querying and Linking the Data Part 4. User Support A part of the NCBI Bookshelf

  10. OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases

  11. The Entrez System Gene UniGene CancerChromosomes UniSTS Homologene SNP PopSet Genome Nucleotide GEO Books Entrez Taxonomy PubMed GEO Datasets MeSH OMIM Protein PMC Journals Domains 3D Domains Structure

  12. Taxonomy

  13. zebrafish

  14.   

  15. The Global Entrez search engine

  16. Types of Databases • Primary Databases • Original submissions by experimentalists • Database staff review and may organize the data, but we don’t add/modify additional information • Records are “owned” and updated by their authors • Examples: GenBank, SNP, GEO • Derivative Databases • Human-curated (compilation and correction of data) • Examples: Gene(LocusLink), Structure & Literature databases • Computationally-Derived • Example: UniGene • Combination • Examples: RefSeq, Genome Assembly, Domain databases

  17. ACGTGC C C GA GA ATT GA GA C ATT TATAGCCG AGCTCCGATA CCGATGACAA RefSeq C TATAGCCG ACGTGC Curators CGTGA ATTGACTA TTGACA Genome Assembly TTGACA TTGACA ACGTGC ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA TATAGCCG ATTGACTA ATTGACTA CGTGA ATTGACTA ATTGACTA ATT TATAGCCG TATAGCCG TATAGCCG TATAGCCG TATAGCCG TTGACA C GenBank UniGene GA AT C C C C ATT GA GA GA GA ATT ATT ATT Algorithms GA GA GA GA C C ATT ATT C C Primary vs. DerivativeSequence Databases Labs Sequencing Centers Updated continually by NCBI Updated ONLY by submitters

  18. Examples of tag delimiters How to Query a Particular Database term1 term2 (term1[tag delimiter]op term2[tag delimiter]op …) op = AND, OR, NOT • Boolean operators MUST be in ALL CAPS! tag delimiter= Entrez indexing field Organism Journal User compounds Author

  19. Sample Query Brauninger a c-src kinase Organism Journal User compounds Author

  20. Using Fields to Find Records Accession All Fields Author EC/RN Number Feature Key Filter Gene Name Issue Journal Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Volume • Most useful search field [Organism]: • human[orgn] …or… bacteria[orgn] • Useful search terms in [Properties] field: • srcdb: “source database” ( srcdb genbank[prop] ) • gbdiv: “genbank division” ( gbdiv est[prop] ) • biomol: “biomolecular type” ( biomol mrna[prop] )

  21. Using Field Limits #1: thyroid peroxidase 335 #2: thyroid peroxidase AND human[orgn] 291 #3: thyroid peroxidase[title]AND human[orgn] 166 #4: #3 AND srcdb refseq[prop] 5 #5: #3 AND srcdb ddbj/embl/genbank[prop] 161 #6: #5 AND gbdiv est[prop] 20 #7: #5 AND gbdiv pri[prop] 141 #8: #7 AND biomol genomic[prop] 25 #9: #7 AND biomol mrna[prop] 116

  22. Complex searches you can do with Preview/Index Terms used (and indexed) in Entrez fields can be searched to gain useful information! How many rat Unigene clusters contain at least one mRNA? • Select the UniGene database. • Find all the rat records. • Find those that have ≥ 1 mRNAs. (“not 0”) NOT rat [organism]

  23. Complex Queries with Preview/Index NOT 0 [mRNA Count]

  24. 1º Sequence Database GenBank • Nucleotide only sequence database • Archival in nature • Submission of GenBank Data to NCBI • Direct submissions of individual records via Web (BankIt, Sequin) • Batch submissions of bulk sequences via Email (EST, GSS, STS) • FTP accounts for Sequencing Centers

  25. Sequence records • Total base pairs 35 40 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 GenBank Release 143: 37.3 million records 41.8 billion nucleotides Average doubling time ≈ 14 months Sequence Records (millions) Total Base Pairs (billions) ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04

  26. Release 143 August 2004 37,343,937 Records 41,808,045,653 Nucleotides >170,000 Species 160 Gigabytes 657 files GenBank • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/

  27. The International Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry

  28. Organization of GenBank:GenBank Divisions (gbdiv) Records are divided into 17 Divisions. • 1 Patent (11 files) • 5 High Throughput • 11 Traditional EST (335) Expressed Sequence Tag GSS (116) Genome Survey Sequence HTG (61) High Throughput Genomic STS (5) Sequence Tagged Site HTC (6) High Throughput cDNA PRI (28) Primate PLN (12) Plant and Fungal BCT (10) Bacterial and Archeal INV (6) Invertebrate ROD (13) Rodent VRL (3) Viral VRT (7) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized

  29. File Formats of theSequence Databases Each sequence is represented by a text record called a flat file. • GenBank/GenPept (useful for scientists) • FASTA (the simplest format) • ASN.1 & XML (useful for programmers)

  30. Accession Number ACCESSION AF062069 VERSION AF062069.2 GI:7144484 Length mRNA = cDNA DNA = genomic Date of most recent modification Division ORGANISM Limulus polyphemus Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata; Xiphosura;Limulidae;Limulus. Accession.Version GI Number LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. A Traditional “GenBank” Record LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Definition =Title References NCBI’s Taxonomy

  31. /protein_id="AAC16332.2" /db_xref="GI:7144485" Lower down in the GenBank Record FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // Feature Table GenPept Protein ID

  32. FASTA format >gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cds GAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAG AAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTT CTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTC TTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAG GAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCG AGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCA CAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTT ACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAA CAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAG GACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCC GGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCT CCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC ... >gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens] MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSG AQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPP KCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTR HVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARP AAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLL RVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAA LKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAF RFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEEL TERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPD NIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTG LTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRH GYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD ...

  33. GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Abstract Syntax Notation: ASN.1 Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Human thyroid peroxidase mRNA, partial cds., and translated products" , source { org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" ,

  34. Bulk Divisions • Batch Submission and htg (email and ftp) • Inaccurate • Poorly Characterized • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents

  35. 5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags gbdiv_est[Properties] nucleus 30,000 genes gatccantgccatacg >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG ctcgccaattcnntcg • - isolate unique clones • sequence once • from each end RNA gene products >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

  36. Genome Sequencing - HTG, GSS,(WGS) Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive whole genome shotgun assemblies (traditional division) assembly Draft Sequence (HTG division)

  37. HTG Division: Honeybee Draft Sequences • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences move to traditional GenBank division

  38. Other Primary Databases • GEO (Gene Expression Omnibus) • Searchable microarray data repository • SNP (Single Nucleotide Polymorphism) • Allelic variations (including minisatellites/ simple sequence repeats and insertions/ deletions)

  39. Redesigned with new features • Submit and update data • Query the database: • gene identifiers • field information • sequence • Browse datasets • Download data

  40. Submitted by Experimentalists Curated by NCBI Submitted by Manufacturer* GDS Grouping of experiments GSE Grouping of slide/chip data “a single experiment” GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip Entrez GEO Datasets Entrez GEO

  41. FHCRC non-commercial human 18K array Comparison of gene expression profiles of HFF cells infected with CMV strains GDS177: CMV infection of HFF cells src1: CMV infected fibroblasts src2: uninfected fibroblasts GSM827 : FHCMV-T-1GSM825 : FHCMV-T-2GSM828 : FHCMV-T-3 GSM829 : FHCMV-H-1GSM830 : FHCMV-H-2GSM831 : FHCMV-H-3 GSM832 : CMV_AD169-2GSM833 : CMV_AD169-3 Expression

More Related