1 / 18

Bioinformatics Part 2: The Primary Databases

Bioinformatics Part 2: The Primary Databases. The nucleic acid and protein databases Database content: what’s in the databases and how are the records structured? Searching of and retrieval from the databases. The Primary Nucleic Acid and Protein Databases. Nucleic acids GenBank, EMBL, DDBJ

evette
Download Presentation

Bioinformatics Part 2: The Primary Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsPart 2: The Primary Databases • The nucleic acid and protein databases • Database content: what’s in the databases and how are the records structured? • Searching of and retrieval from the databases

  2. The Primary Nucleic Acid and Protein Databases • Nucleic acids • GenBank, EMBL, DDBJ • Proteins • SWISS-PROT/TrEMBL, PIR and others • Search and retrieval tools • Entrez, Sequence Retrieval System (SRS)

  3. GenBank, etc. SWISS-PROT, etc. Using the Primary Databases Entrez SRS Query Result

  4. A Nucleic Acid Database: GenBank • Nucleotide sequences of genes and parts of genes: highly annotated • Sequence tagged sites (STSs) • Expressed sequence tags (ESTs) • Genome survey sequences (GSSs) • High throughput genomic sequences (HTGs) • Nucleotide sequences that form part of a patent

  5. Sequence Tagged Sites (STSs) • A short DNA sequence, up to 500 nucleotides in length • Unique in the genome • Location in the genome is known • Can be detected using the polymerase chain reaction (PCR) • Act as “beacons” or “landmarks” for genome mapping studies

  6. Expressed Sequence Tags (ESTs) • Similar to STSs, but obtained from cDNA instead of genomic DNA • Unlike STSs, ESTs may not be unique, as some genes have very similar or even identical sequences • An EST may also be an STS • ESTs provide an indication of gene density • January 2002: almost 4 million ESTs identified for the human genome

  7. Genome Survey Sequences (GSSs) • Random “single pass read” genome survey sequences • Cosmid/BAC/YAC end sequences • Exon-trapped genomic sequences • exon-trapping is a technique that removes introns from a cloned segment of genomic DNA • Alu PCR sequences • Alu PCR amplifies genomic DNA between Alu repeats (short, dispersed elements found in the human genome)

  8. High Throughput Genomic Sequences (HTGs) • An “unfinished” HTG would contain a few contigs (each at least 2 kbp in length), with gaps, possibly unordered, and derived from a single genomic DNA clone • A “finished” HTG would be the assembled sequence with no gaps, and with annotations unfinished finished

  9. What’s in a “Full” GenBank Record? • LOCUS, DEFINITION, ACCESSION, KEYWORDS • SOURCE • REFERENCE - including publication details if available • COMMENT • FEATURES - exons, introns, location of coding sequence (CDS), translation of CDS, etc. • BASE COUNT • ORIGIN - the nucleic acid sequence

  10. The Entrez Nucleotides Search and Retrieval System • By default, all major nucleotide databases (GenBank, EMBL, etc.) are searched • Allows limits to be placed on the search (e.g., to a particular field such as keyword, organism, etc.) • Allows subsets of the databases to be searched • Accepts Boolean operators (AND, OR, NOT) • Previous searches can be combined • Results can be saved to a clipboard

  11. http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

  12. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide

  13. http://www.ncbi.nlm.nih.gov/genome/guide/human/

  14. The Sequence Retrieval System (SRS) • Performs a similar function to Entrez • SRS can search across several databases simultaneously • Databases to be searched can be defined by the user • Uses a single interface to design the query

  15. http://srs.ebi.ac.uk/

  16. A Protein Database: SWISS-PROT/TrEMBL • SWISS-PROT was created in 1986 to provide highly curated, richly annotated records of protein sequences • TrEMBL (containing translations of coding sequences in EMBL) was created in 1996 to provide a supplement to SWISS-PROT • provides less detailed information than SWISS-PROT but allows access to recent sequences • Advanced searching is available through the Sequence Retrieval System (SRS)

  17. What’s in a SWISS-PROT Record? • Identification (ID), accession (AC), dates of entry and modification (DT), description (DE), gene name (GN), organism details (OS, OC, etc.) • Reference details (RN, RA, etc.) • Comments (CC) • Database cross-references (DR) • Keywords (KW) • Feature table (FT) • Amino acid sequence (SQ)

  18. http://www.expasy.org/sprot/

More Related