1 / 39

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources. Sequin BankIt WebIn SAKURA. Entrez. Submissions Updates. Submissions Updates. Submissions Updates. SRS. getentry. The International Sequence Database Collaboration. NIH. NCBI. GenBank. EMBL. DDBJ. EBI. CIB. NIG. EMBL.

bayle
Download Presentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI Molecular Biology Resources

  2. Sequin • BankIt • WebIn • SAKURA Entrez • Submissions • Updates • Submissions • Updates • Submissions • Updates SRS getentry The International Sequence Database Collaboration NIH NCBI GenBank EMBL DDBJ EBI CIB NIG EMBL

  3. Primary vs. Derivative Databases C C GA ATT GA UniGene GA C ATT GA Algorithms C TATAGCCG Sequencing Centers ACGTGC ACGTGC TTGACA ATTGACTA CGTGA UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT ACGTGC RefSeq: LocusLink and Genomes Pipelines Curators TATAGCCG AGCTCCGATA CCGATGACAA Labs

  4. Primary Databases Original submissions by experimentalists Remember biologyʼs Central Dogma: DNA → RNA → protein. Primary refers to one dimensional ʻsymbolʼ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide. Content controlled by the submitter Examples: GenBank, SNP, GEO, PubChem Substance Derivative Databases Built from primary data Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound Types of Databases

  5. Entrez Global Query is an integrated search and retrieval system for databases of National Center for Biotechnology Information (NCBI). It provides access to all NCBI databases simultaneously with a single query string and user interface. Support boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database. A text search / retrieval engine NOT A DATABASE. A tool for finding biologically linked data. A virtual workspace for manipulating large datasets. What is Entrez?

  6. Each record is assigned a UID unique integer identifier for internal tracking GI number for Nucleotide Each record is given a Document Summary a summary of the record’s content (DocSum) Each record is assigned links to biologically related UIDs Each record is indexed by data fields [author], [title], [organism], and many others Entrez Databases

  7. The Entrez System Journals UniGene Books SNP PubMed UniSTS PubMed Central Nucleotide PopSet Protein ProbeSet Entrez Genome Structure Taxonomy CDD OMIM 3D Domains

  8. The Entrez System: Text Searches

  9. Entrez Taxonomy The backbone of NCBI [organism]

  10. GenBank: Primary Data (97.9%) original submissions by experimentalists submitters retain editorial control of records archival in nature RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually An Entrez Database - Nucleotide

  11. Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database What is GenBank?NCBI’s Primary Sequence Database

  12. Header Feature Table Sequence A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // The Flatfile Format

  13. An Example Record – M17755 Indexing for Nucleotide UID 4680720 Field Indexed Terms [primary accession] M17755 [title] Homo sapiens thyroid peroxidase (TPO) mRNA… [organism] Homo sapiens [sequence length] 3060 [modification date] 1999/04/26 [properties] biomol mrna gbdiv pri srcdb genbank

  14. M17755: Feature Table TPO[gene name] CDS position in bp thyroiditis [text word] thyroid peroxidase [protein name] protein accession

  15. Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!

  16. RefSeq database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein. Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Updated to reflect current sequence data and biology Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information. They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records. RefSeq: NCBI’s Derivative Sequence Database

  17. The common Refseq accession prefix

  18. Entrez Gene and RefSeq Gene GenBank RefSeq Nucleotide • Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI • Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs) • Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases

  19. Entrez Gene: RefSeq Annotations

  20. If your organism does not have RefSeqs… UniGene : gene-based clusters of cDNAs and ESTs WGS sequences in Entrez Nucleotide (wgs[prop]) Trace Archive Beyond RefSeq

  21. What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits New! • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes known genes and uncharacterized ESTs • Useful for gene discovery and selection of mapping reagents

  22. Organisms in UniGene Top Ten 1. Human 2. Rice 3. Mouse 4. Cow 5. Wheat 6. Zebrafish 7. Pig 8. Chicken 9. Frog (X. laevis) 10. Frog (X. tropicalis)

  23. Finding UniGene Clusters by link by Entrez search

  24. UniGene Cluster for TPO

  25. GenPept (DDBJ, EMBL, GenBank)4,444,405 RefSeq 1,753,167 PIR 222,395 Swiss Prot 189,005 PDB 68,621 PRF 12,079 Third Party Annotation 4,219 Total 6,693,891 Entrez Protein

  26. Protein Sources and Links PIR no mRNA! RefSeq  NM_000537 SWISS-PROT nomRNA! GenPept  M17755

  27. PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966. In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE. MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.

  28. Taxonomy Browser is… • browser for the major divisions of living organisms • (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms

  29. Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST)

  30. OMIM is… • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • edited by Dr. Victor McKusick, others at JHU

  31. SWISS-PROT Manually curated high-quality annotations, less data GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date PIR Phylogenetic-based annotations All 3 now combining efforts to form UniProt (http://www.uniprot.org) General Protein Databases

  32. Swissprot format http://us.expasy.org/sprot/userman.html

  33. Sequence data only: cannot be browsed, can only be searched using a sequence Combine sequences from more than one database Examples: NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein) Non-redundant Databases

  34. Protein domain databases • Pfam (http://www.sanger.ac.uk/Software/Pfam/) Collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families • SMART (a Simple Modular Architecture Research Tool) Identification and annotation of genetically mobile domains and the analysis of domain architectures (http://smart.embl-heidelberg.de/help/smart_about.shtml • CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) Combines SMART and Pfam databases Easier and quicker search

  35. Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/) Store conserved motifs occurring in nucleic acid or protein sequences Motifs can be stored as consensus sequences, alignments, or using statistical representations such as residue frequency tables Sequence Motif Databases

More Related