1 / 58

NCBI

NCBI. Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001).

eugenia
Download Presentation

NCBI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI • Created as a part of NLM in 1988 • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information • Tools: BLAST(1990), Entrez (1992) • GenBank (1992) • Free MEDLINE (PubMed, 1997) • Human genome (2001)

  2. NCBI Home Pagewww.ncbi.nlm.nih.gov To learn more, visit “Site Map” and “About NCBI” web pages

  3. Entrez:An Integrated Database Search and Retrieval System

  4. Entrez The (ever) Expanding Entrez System PubMed Nucleotide UniGene Protein Journals Structure CDD Genome PopSet SNP OMIM 3D Domains Taxonomy UniSTS ProbeSet Books

  5. Literature Databases • PubMed • Books • PubMed Central • Journals • On-Line Mendelian Inheritance in Man (OMIM)

  6. Molecular Sequence Databases • Sequence Databases • Nucleotide (GenBank) • Taxonomy • PopSet • Protein • Marker Databases • Single Nucleotide Polymorphisms (SNP’s, dbSNP) • Sequence Tagged Sites (STS’s, dbSTS) • Expressed Sequence Tags (EST’s, dbEST) • UniGene

  7. Molecular Databases • Primary Databases • Original submissions by experimentalists • Database staff organize but don’t add additional information • Example:GenBank • Derivative Databases • Human curated • compilation and correction of data • Example:SWISS-PROT, NCBI RefSeq mRNA • Computationally Derived • Example:UniGene • Combinations • Example:NCBI Genome Assembly

  8. ACGTGC Curators C C GA ATT GA GA C ATT GA C RefSeq TATAGCCG ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA CGTGA TATAGCCG ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank GA UniGene AT C C Algorithms ATT C C GA ATT GA GA ATT GA GA ATT GA C GA C ATT GA

  9. NIH NCBI ENTREZ GenBank NIG CIB Get Entry DDBJ EMBL EBI SRS EMBL The International Nucleotide Sequence Database Collaboration

  10. RefSeq 1% PDB 0.01% EMBL 9% DDBJ 19% GenBank 71% Entrez Nucleotide

  11. What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions individual records (BankIt, Sequin) • Batch submissions via email (EST, GSS, STS) • ftp accounts established for sequencing centers • Data shared amongst three collaborating databases: • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL)

  12. The Old Way From Fran Lewitter, Whitehead Institute

  13. Release 136 June 2003 25,592,865 Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data GenBank: NCBI’s Primary Sequence Database

  14. GenBank Divisions Traditional Divisions BCT Bacterial/Archeal INV Invertebrate MAM Mammalian (ex. ROD/PRI) PHG Phage PLN Plant/Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate Bulk Sequence Divisions EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTGS High Throughput Genomic Sequence HTC High Throughput cDNA

  15. A Traditional GenBank Record Locus Field Molecule Type Modification Date Definition Line GenBank Division GI (GenInfo) Keywords Taxonomy Submission Field

  16. Feature Table GenPept Record Genomic DNA Sequence

  17. Bulk Sequence Divisions Bulk Sequence Divisions EST Expressed Sequence Tag STS Sequence Tagged Site HTGS High Throughput Genomic Sequence •Batch Submission, e-mail, or ftp •Inaccurate •Poorly Characterized

  18. 5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 30,000 genes gatccantgccatacg ctcgccaattcnntcg >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC • - isolate unique clones • sequence once • from each end RNA gene products

  19. What is UniGene? A gene-oriented view of sequence entries • MegaBlast-based automated sequence clustering • Nonredundant set of gene-oriented clusters • Each cluster represents a unique gene • Provides information on tissue-specific expression and map locations • Includes well-characterized genes and novel ESTs • Useful for gene discovery and selection of mapping reagents

  20. Query Sequence (muscle creatine kinase mRNA) EST hits to Homo sapiens muscle creatine kinase mRNA 3’ EST Hits 5’ EST Hits

  21. UniGene Entry for H. sapiens Muscle Creatine Kinase

  22. STS Division :Sequence Tagged Sites • Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) • PCR with STS primers gives one product per genome • Basis of Radiation Hybrid Mapping • UniGene • Genome Assembly • Related resource: Electronic PCR

  23. UniSTS:Database of Mapped Markers

  24. phase 1 HTG Acc = AC109609.1 phase 2 HTG Acc =AC109609.6 ROD phase 3 Acc = AC109609.10 HTG Division: High Throughput Genome unfinished, may be unordered,with gaps unfinished, oriented,ordered,may have gaps finished,no gaps Same accession numbers, different versions 40,000 to > 50,000 bp

  25. HTG Division: High Throughput Genome

  26. RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis • Human model transcripts and proteins • Assembled Genomic Regions (contigs) • draft human genome • mouse genome • Chromosome records • Microbial • viral • organelle

  27. Reference Sequences Chromosome: NC_000000 mRNA: NM_000000 protein: NP_000000 Gene: NG_000000 Contig: NT_000000 NW_000000 RNA: NR_000000 Model mRNA: XM_000000 Model protein: XP_000000 Curated Automated Model RNA: XR_000000

  28. RefSeq Chromosomes:NC_ LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS . SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605

  29. RefSeq Contig: NT_, NW_

  30. Curated RefSeq Records: NM_, NP_

  31. Alignment Generated Transcripts:XM_,XP_

  32. REFSEQ: Summary

  33. BLASTa starting point for most bioinformatics related problems…

  34. BLAST

  35. One BLAST, many flavors

More Related