1 / 23

Nucleotide Sequence Databases

Nucleotide Sequence Databases. Your guide to genes & genomes. Nucleotide Sequence Databases. First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery

china
Download Presentation

Nucleotide Sequence Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nucleotide Sequence Databases Your guide to genes & genomes

  2. Nucleotide Sequence Databases • First generation • GenBank is a representative example • started as sort of a museum to preserve knowledge of a sequence from first discovery • great repositories, particularly for long-term study of bioinformatic data • flat files; not built for (and not great at) querying

  3. Nucleotide Sequence Databases • Second generation: • Entrez gene is an example • information is gene-centric (not just sequence-centric) • all sequence information for a given gene can be found in one place

  4. Nucleotide Sequence Databases • Third generation: • Ensembl is a good example • Information is organized around whole genomes; not only a specific gene’s structure, but its context: • position of this gene relative to others • strand orientation • how gene relates to presence or absence of biochemical functions in organism

  5. Prokaryotes (& Archaea) • microscopic organisms • single cell • no nucleus • simple genome: • single, circular DNA molecule • 600,000 – 8 million base pairs • 70% of genome codes for proteins

  6. Prokaryotes (& Archaea) • genes don’t overlap • no introns; mRNA is collinear with gene sequence • protein sequences derived by translating longest ORF (ATG to STOP) spanning gene-transcript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

  7. Thought for today … source: http://www.scicomics.com/uploads/prokaryote.jpg

  8. Eukaryotes • way more complicated • genes found in cell nucleus • genome size: 10 million – 670 million base pairs • much lower gene density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

  9. Eukaryotes • much less efficient than prokaryotes; less than 5% of human genome codes for protein • genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/

  10. Eukaryotes • Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein • A single gene may (and often does) exhibit more than one mRNA and protein form

  11. GenBank • First example: prokaryotic gene • point your browser to: http://www.ncbi.nlm.nih.gov/entrez • choose Nucleotide from the Search pull-down menu • in For box, type X01714 and click Go • Click the link labeled X01714 • Can “Send To Text” if you want to save the file

  12. GenBank fields • LOCUS • size of sequence (in base pairs) • nature of molecule (e.g. DNA or RNA) • topology (linear or circular) • DEFINITION: brief description of gene • ACCESSION: unique identifier for this (and some other) databases • VERSION: lists synonymous or past ID numbers

  13. GenBank fields • KEYWORDS: list of terms related to entry; can be used for keyword searching for related data • SOURCE: common name of relevant organism • ORGANISM: complete id, with taxonomic classification • note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE

  14. GenBank fields • REFERENCE: credits author(s) who initially determined the sequence; includes subsections: • AUTHOR • TITLE • JOURNAL • PUBMED • COMMENT: free-formatted text that doesn’t fit in another category

  15. GenBank fields • FEATURES: table describing gene regions and associated biological properties • source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences • promoter: precise coordinates of promoter element in the sequence; may be more than one of these • misc feature: in this example, indicates (putative) location of transcription start (mRNA synthesis) • RBS (ribosome binding site): location of last upstream element • CDS (CoDing Segment): describes the ORF

  16. GenBank fields: FEATURES: CDS • gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) • several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence • /translation section gives computer translation of sequence into amino acid sequence

  17. Last Section: sequence itself • This is the most important section in terms of analysis using other tools • Can isolate just this section and save the file, as follows: • Choose FASTA from the Display pull-down menu (top of page) • Choose Text in the Send To pull-down menu • Use File/Save As to save the file • use “Text” as file type • give the file a name that you’ll know to associate with this particular sequence

  18. Example 2: eukaryotic mRNA • Can obtain this example by searching Nucleotide database for U90223 • Similar to prokaryote example, because we’re looking at a direct coding sequence for a protein – not DNA, in other words • Notes on example: • KEYWORD field is empty: this is an example of an incomplete annotation • remember, you’re looking at a primary database! • FEATURES field contains some new terms: • sig_peptide: location of mitochondrial targeting sequence • mat_peptide: exact boundaries of mature peptide

  19. Example 3: Eukaryotic gene • Can obtain this record by searching Nucleotide for AF018430 • General information: • LOCUS: same info as previous examples – note the locus name is different from the accession number this time • DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes • SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein

  20. Eukaryotic gene: FEATURES section • source subsection includes a /map section: • indicates chromosome (15) • arm (q means long arm) • cytogenic band (q21.1)

  21. Eukaryotic gene: FEATURES section • gene subsection: describes how to reconstruct the mRNAs found in this and separate entries: • the strings that begin “AF” refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries • if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry • The < and > signs indicate that the start and stop points are only approximate

  22. Eukaryotic gene: FEATURES section • mRNA section: can be read in a similar manner to the gene section • note that there are two mRNA sections (each followed by a CDS section) • first section describes mitochondrial RNA • second section describes nuclear RNA • exon section: indicates position of exon(s) in sequence

  23. Retrieving GenBank entries without accession numers • Search Nucleotide for specific product you’re interested in; for example: human[organism] AND dUTPase[Protein name] • this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears • retrieves several more entries, some DNA and some mRNA • terms used in the titles of these entries can give us additional search criteria: human[organism] AND “dUTPpyrophosphatase”[Title] • yields somewhat different set of entries

More Related