1 / 43

Biological Databases

Biological Databases. What types of data are available? What is a database? What are Genbank and Entrez? What does a typical entry look like? How does one use the database?. BIO520 Bioinformatics Jim Lund. NCBI Biological Databases. Central Dogma-o-centric Genomic DNA sequence

zan
Download Presentation

Biological Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Databases What types of data are available? What is a database? What are Genbank and Entrez? What does a typical entry look like? How does one use the database? BIO520 Bioinformatics Jim Lund

  2. NCBI Biological Databases Central Dogma-o-centric • Genomic DNA sequence • mRNA/cDNA sequence • Protein sequence • Protein 3D structure • Literature (Function)

  3. Biological Data • Genomic DNA sequence (complete) • mRNA/cDNA sequence • Gene expression data (NEW) • Microarrays, SAGE • Expression catalogs • Protein sequence • Protein interaction/complex data (NEW) • Protein 3D structure • Literature (Function) • Organism databases (NEW) • Annotation and classification projects (NEW)

  4. What is a Biological Database? An organized body of persistent data and associated computer software for updating, querying, and retrieving data records. • Collection of records and files • Organized for a particular purpose • The database is separate from the interface and can have several interfaces. • NCBI Protein can be searched by protein name or using BLAST (Basic Local Alignment Search Tool).

  5. Common database features • Relational Databases • Tables • Relationships between tables • Version Control • Consistency enforcement • Multiauthor/multiuser with security

  6. BIO520 Name ID Grade Amy 123 A Joe 456 B Sue 789 C Value BIO 520 Student Database Column Table . Record

  7. Genbank Entry LOCUS BC005255 495 bp mRNA linear PRI 23-JUN-2006 DEFINITION Homo sapiens insulin, mRNA (cDNA clone IMAGE:3950204), complete cds. ACCESSION BC005255 VERSION BC005255.1 GI:13528923 KEYWORDS MGC. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. FEATURES Location/Qualifiers source 1..495 /organism="Homo sapiens" gene 1..495 /gene="INS" /db_xref="GeneID:3630" CDS 60..392 /gene="INS" /translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCG ERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSL YQLENYCN" ORIGIN 1 agccctccag gacaggctgc atcagaagag gccatcaagc agatcactgt ccttctgcca … 421 ccgcctcctg caccgagaga gatggaataa agcccttgaa ccaacaaaaa aaaaaaaaaa 481 aaaaaaaaaa aaaaa //

  8. The CORE: DDBJ, EMBL, and Genbank

  9. Genbank DNA Sequence Database • Genbank/EMBL/DDBJ mirror & exchange sequence records. • Primary vs. Secondary Databases • nr (non-redundant database) • Primary vs. secondary records • Sequence vs. inferred property (coding region)

  10. Primary vs. Derivative Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

  11. Header Feature Table Sequence LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // A TraditionalGenBank Record The Flatfile Format (formatted text)

  12. Genbank Entry LOCUS PCU30791 1234 bp mRNA PLN 31-MAY-1996 DEFINITION Pneumocystis carinii carinii form 6 guanine nucleotide binding protein alpha subunit (pcg1) mRNA, complete cds. ACCESSION U30791 NID g1345098 VERSION U30791.1 GI:1345098 Unique ID Version Control

  13. NCBI TAXONOMY Can change Content-Taxonomy SOURCE Pneumocystis carinii f. sp. carinii. ORGANISM Pneumocystis carinii f. sp. carinii Eukaryota; Fungi; Ascomycota; Archiascomycetes; Pneumocystidaceae; Pneumocystis.

  14. Reference REFERENCE 1 (bases 1 to 1234) AUTHORS Smulian,A.G., Ryan,M., Staben,C. and Cushion,M. TITLE Signal transduction in Pneumocystis carinii: characterization of the genes (pcg1) encoding the alpha subunit of the G protein (PCG1) of Pneumocystis carinii carinii and Pneumocystis carinii ratti JOURNAL Infect. Immun. 64 (3), 691-701 (1996) PUBMED 96186460 • Unique cross reference • Can be >1 reference

  15. Features FEATURES Location/Qualifiers source 1..1234 /organism="Pneumocystis carinii f. sp. carinii“ /strain="Form 6“ /note="450 kb chromosome" /db_xref="taxon:38081“ 5'UTR 1..90 gene 91..1155 /gene="pcg1" Correct?

  16. Related info in another database CDS CDS 91..1155 /gene="pcg1” /note="G-protein alpha subunit" /codon_start=1 /product= "guanosine nucleotide binding protein alpha subunit" /protein_id="AAC49295.1" /db_xref="PID:g1345099" /db_xref="GI:1345099" /translation="MGCCFSATYNQDTLRSKEIE SYLRQEQEHACHEAKILLLGAGES… . INFERRED

  17. DNA BASE COUNT 421 a 171 c 195 g 447 t ORIGIN 1 tgaattctaa attttatatt … 1201 … tattttttta tgctccagat aaaa //

  18. Genbank entries • Combination of required (LOCUS, SOURCE) and optional fields. • The entry is hierarchical, some fields contain subfields. • REFERENCE->AUTHORS • Some fields can appear multiple times (REFERENCE, /gene) • Some fields are numerical, other are text. Some fields contain free text, others use a controlled vocabulary or an database ID.

  19. Other Genbank output formats • FASTA • Simple, little annotation information • Easy to use • Common denominator format • ASN1 • Computer friendly, human unfriendly • XML, INSDSeqXML, TinySeqXML • Graph (graphical map of seq features) …and more

  20. DNA Sequence Files Common formats • Genbank (used by VectorNTI) • FASTA • GCG • Accelrys GCG (Genetics Computer Group) package • formerly GCG Wisconsin Package Many others!

  21. FASTA One annotation line only! >gi|1345098|gb|U30791.1|PCU30791 TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGATAATACTAGATTTATTCCTGGAAACT TAAATTAGTTATTTTAAGTTATGGGATGTTGTTTTTCTGCTACATATAACCAAGATACACTTCGTTCCAA

  22. Submitting sequences to Genbank Sequin Stand-alone sequence submission tool. BankIt Web based sequence submission.

  23. Genbank is an ARCHIVE The literature and secondary databases are the knowledge sources. There are many additional NCBI annotation databases

  24. Genbank -> RefSeq (Single sequence for each gene) • Entrez Gene (Gene-based links to annotation sources). • HomoloGene (Homologs) • OMIM • Conserved domains, 3D domains • GEO (Gene expression datasets) • DNA, protein, 3D structures • Interaction data • Links to other databases! • NCBI Genomes • NCBI Map viewer NCBI annotation databases!

  25. Finding and editing DNA files • Find DNA: Entrez • Downloading files • Format Conversion • Sequence viewing/editing

  26. Entrez • Database searching/browsing • Example: Pneumocystis G-proteins • PCR a cDNA to express in E. coli • Read about it and related genes • Check similarity to related G-proteins • View the 3D structure?? • http://www.ncbi.nlm.nih.gov/Entrez/

  27. Entrez Neighbors-Protein 3D Structure citation citation Protein Literature encoding BLASTP Protein DNA

  28. Mapping the menagerie of biological databases

  29. Nucleic Acid Manipulations • On the web: • Baylor Human Genome Center (BCM) http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html • European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/Tools/misc.html

  30. Readseq Download program: http://iubio.bio.indiana.edu/soft/molbio/readseq Use online: http://www.ebi.ac.uk/cgi-bin/readseq.cgi http://searchlauncher.bcm.tmc.edu/seq-util/readseq.html DNA/Protein sequence format conversion Beware Information Loss!

  31. Reverse Complementing 5’-GAATCA-3’ 5’-TGATTC-3’ NOT 5’-ACTAAAG-3’

  32. Sequence Statistics • Nucleotide frequencies (di, tri…) • UV Absorbance • MW • Tm

  33. Restriction Map • Linear vs Circular • Enzyme sets • Which enzymes, where they cut. • Gel simulation • Gel-to-map MUCH harder!! • Useful for: • Cloning • Southern blots • Specialized mol bio techniques

  34. Translation/ORFs • Translation table • Standard vs non-standard • Frame (1,2,3,4,5,6) • Segmental translation (exon-intron) • Primary translation vs mature polypeptide

  35. Text editor Notepad Word processor vi Sequence Annotation and Editing • Artemis • Sequin • NCBI’s Genbank entry creation/viewing tool MWGTCC IIIIII MWGTCC IIIIII Nonproportional fonts (courier, monospaced…)

  36. Primer design program: Primer3 http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi

  37. Primary vs. Derivative Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

  38. Other NCBI Databases Structure:imported structures (PDB) Cn3D viewer, NCBI curation CDD:conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) dbSNP:nucleotide polymorphism Gene:gene records Unifies LocusLink and Microbial Genomes

  39. Homologene Cluster

  40. Entrez Protein: Derivative Database

  41. >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... GenPept >gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|4557757|ref|NP_000240.1| MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... NCBI RefSeq >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... Swiss-Prot >gi|741682|prf||2007430A DNA mismatch repair protei... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... PRF Redundant Proteins

  42. RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more • Model transcripts and proteins • Assembled Genomic Regions (contigs) • human • mouse • rat • Chromosome records • Human genome • microbial • organelle • chicken • honeybee • sea urchin • zebrafish • cow • dog • black poplar srcdb_refseq[Properties] ftp://ftp.ncbi.nih.gov/refseq/release/

  43. RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455 also Microbial replicons, organelles genomes, human chromosomes Assemblies NT_123456Contig NW_123456WGSSupercontig http://www.ncbi.nlm.nih.gov/RefSeq/key.html

More Related