Biology 4900
Sponsored Links
This presentation is the property of its rightful owner.
1 / 63

Biology 4900 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Biology 4900. Biocomputing. Chapter 2. Molecular Databases and Data Analysis. Literature Databases. Online databases available at CSU Galileo JSTOR Online databases at other sites

Download Presentation

Biology 4900

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Biology 4900

Biology 4900


Chapter 2

Chapter 2

Molecular Databases and Data Analysis

Literature databases

Literature Databases

  • Online databases available at CSU

    • Galileo

    • JSTOR

  • Online databases at other sites

    • PubMed. If you find a useful article, you can check PubMed Central to see if it is available online for free.

  • Where to get articles

    • PubMed Central

    • GIL

    • Interlibrary loan

Sources of molecular data

Sources of Molecular Data














*Expressed Sequence Tags

Molecular databases

Molecular Databases

  • Primary Database

    • Archival - sequences submitted directly from experimental sequencing results

      • Very little interpretation

      • Anyone can submit; accuracy not checked

      • Examples

        • Nucleic Acid: EMBL, DDJB, GenBANK

        • Protein: Swiss-Prot, PIR, PDB

  • Secondary Databases

    • Curated– sequences are validated/checked and may be annotated

      • Refseq (nucleic acids and proteins, but limited to certain organisms)

      • TrEMBL, GenPept, Uniprot

Nucleic acid databases

Nucleic Acid Databases

  • Contain:

    • Nucleic acid sequences

      • Chain termination method (Sanger sequencing)

        • Used for sequences 100-1000 bp

      • Whole Genome Shotgun (WGS) Sequencing

        • Used for sequences >1000 bp

        • DNA chopped into little chunks

        • Sequenced using chain termination method (reads)

        • Numerous, overlapping reads are collected and assembled into sequence (computational methods)

    • Annotations for each sequence

      • Putative identification of open reading frames (ORFs = parts of gene that encode protein) in sequence

      • Putative intron(excised)/exon(retained) locations

      • Authors, dates, publication, etc.

Biology 4900

International Nucleotide Sequence Database Collaboration(Public nucleotide and protein sequence databases)

Name: GenBank

Location: National Institutes of Health, National Center for Biotechnology Information


Daily Info sharing

Daily Info sharing



Daily Info sharing

Name: European Molecular Biology Laboratory (EMBL)

Location: European Bioinformatics Institute (EBI)

Name: DNA Database of Japan (DDBJ)

Location: National Institute of Genetics, Mishima



  • As of April 2011, There were approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions.

  • Read the following paper:

  • Home Page:

Homo sapiens14.9 billion bases

Mus musculus8.9b

Rattus norvegicus6.5b

Bos taurus5.4b

Zea mays 5.0b

Sus scrofa4.8b

Danio rerio3.1b

Strongylocentrotus purpurata1.4b

Oryza sativa (japonica)1.2b

Nicotiana tabacum 1.2b

Most sequenced organisms

Genbank home page

GenBank Home Page

Ncbi resources

NCBI Resources

  • PubMed


  • OMIM

  • Taxonomy Browser

  • Structure

Ncbi key features pubmed

NCBI key features: PubMed

  • National Library of Medicine's search service

  • 21 million citations from MEDLINE & others (as of 2011)

  • Links to other online journals


  • Starting point for most research

Biology 4900

Literature Searches through PubMed

Biology 4900

Use the pull-down menu to access related resources such as Medical Subject Headings (MeSH)

Biology 4900

A “how to” pull-down menu links to tutorials

Biology 4900

Use “Advanced search” to limit by author, year, language, etc.

Biology 4900

PubMed search strategies

Try the tutorial

Use boolean queries (capitalize AND, OR, NOT)

lipocalin AND disease

Try using limits (see Advanced search)

There are links to find Entrez entries and external resources

Biology 4900

1 AND 2



lipocalin AND disease

(504 results)

1 OR 2



lipocalin OR disease

(2,500,000 results)

1 NOT 2



lipocalin NOT disease

(2,370 results)

Biology 4900

Save Searches, Save Results, Get Papers

Pubmed author search

PubMed Author Search

Scholar google search

Scholar Google Search


  • Includes references that may not be found in PubMed

Biology 4900

NCBI key features

  • A search from NCBI main page will search:

  • the scientific literature;

  • DNA and protein sequence databases;

  • 3D protein structure data;

  • population study data sets;

  • assemblies of complete genomes

  • String search

    • Search by author, date, keyword, publication, etc.

Classroom exercise:

Author searches

Paper searches

Protein searches

Ncbi key features blast

NCBI key features: BLAST

  • BLAST is…

  • Basic Local Alignment Search Tool

  • NCBI's sequence similarity search tool

  • supports analysis of DNA and protein databases


Biology 4900

NCBI key features: OMIM

  • Online Mendelian Inheritance in Man

  • Catalog of human genes and genetic disorders

Biology 4900

NCBI key features: Taxonomy Browser

  • Browser for the major divisions of living organisms

  • (archaea, bacteria, eukaryota, viruses)

  • Taxonomy information such as genetic codes

  • Molecular data on extinct organisms

  • Useful to find a protein or gene from a species

Biology 4900

NCBI key features: Structure

  • Molecular Modelling Database (MMDB)

  • biopolymer structures obtained from

  • the Protein Data Bank (PDB)

  • Cn3D (a 3D-structure viewer)

  • vector alignment search tool (VAST)

Biology 4900


  • A 3D-structure viewer

  • Must download (

  • Use to align structures identified as similar by VAST

Example researching beta globin

Example: Researching beta globin

  • Beta globin is protein, so it will be found in 3 different types of databases




Entrez Protein





GenBank dbGSS

GenBank dbHTGS

GenBank dbSTS

GenBank Entrez Gene

GenBank dbEST


Gene Expression Omnibus

*Because RNA is unstable, it can be transcribed into complementary DNA (cDNA)

Necessary yet annoying definitions

Necessary (yet annoying) Definitions

  • Sequence Tagged Site (STS): Small DNA fragments with both DNA sequence data and mapping data (genes assigned to chromosomes)

  • Expressed Sequence Tags (EST): Partial DNA sequence of a complementary (cDNA) clone

    • Typically these are randomly-selected cDNA clones sequenced on a single strand (300-800 bp)

    • Useful for identifying novel genes

    • Higher rate of error



  • Unique Gene (Unigene) Project to create gene-oriented clusters by partitioning ESTs into non-redundant sets


    • Ultimately there should be only 1 cluster per gene

    • Usually more than 1 due to errors

    • Types of errors

      • 2 or more clusters may represent different parts of the same gene

      • Sequence errors

      • Cloning artifacts (DNA transcribed during creation of cDNA that doesn’t correspond to authentic transcript)


Unigene Cluster



Genbank flatfile


  • A format for organizing genomic sequence data. Includes the following:

  • Sequence and annotations

  • Header

    • Locus name or accession number: unique to sequence description

    • Size: number of nucleotide bases or amino acid residues

    • Molecule: DNA, RNA, strandedness (ds, ss), and type of RNA or DNA

    • Genbank division code: 18 divisions (PRI = primate, PLN = plant, BAC = bacterial, etc.)

    • Date of last modification

  • Definition Line: brief description of sequence (e.g. source organism, protein/gene name, function)

  • Accession: unique identifier for a record

  • Version

    • May be more than one accession

    • Record modification (accession.1; accession.2)

    • GI: is specific to version; may be more than one

  • Keywords

  • Source: organism or clone description

  • Reference: publications that discuss data reported

  • Authors and Journal publication info

  • PubMed identifier: link to sequence record (abstract)

  • Features: vary (chromosomal info., coding info, protein id, % of each nucleotide)

  • Sequence Data

Jump to example

Biology 4900

What is an accession number?

An accession number is label that is used to identify a sequence. It is a (unique) string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775GenBank genomic DNA sequence

NT_030059Genomic contig (overlapping DNA fragments)

Rs7079946dbSNP (single nucleotide polymorphism)

N91759.1An expressed sequence tag (1 of 170)

NM_006744RefSeq DNA sequence (from a transcript)

NP_007635RefSeq protein

AAC02945GenBank protein

Q28369SwissProt protein

1KT7Protein Data Bank structure record




Biology 4900

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)

provides an expertly curated accession number that

corresponds to the most stable, agreed-upon “reference”

version of a sequence.

RefSeq identifiers include the following formats:

Complete genomeNC_######

Complete chromosomeNC_######

Genomic contigNT_######

mRNA (DNA format)NM_###### e.g. NM_006744

ProteinNP_###### e.g. NP_006735

Unigene name search oncomodulin

UniGene Name Search: Oncomodulin

All results listed

Allows filtering

Unigene name search select human oncomodulin

UniGene Name Search: Select Human Oncomodulin

  • 4 Expressed Sequence Tags from 1 complementary DNA library

  • Identifies chromosome and map position on chromosome

  • Compares cluster transcripts with refseq proteins

Unigene name search select human oncomodulin1

UniGene Name Search: Select Human Oncomodulin

Click on link for menu of other links:

Conserved domains

Gene summary

Protein sequence

Clicking on Protein sequence link then takes you to predicted protein sequence file (NP_006179.2) 

Unigene name search select human oncomodulin2

UniGene Name Search: Select Human Oncomodulin





Once here, you can:

Open FASTA file


Identify and view conserved domains

See related proteins

Biology 4900

Access to sequences: Gene at NCBI

Gene is a great starting point: it collects

key information on each gene/protein from

major databases. It covers all major organisms.

Example: RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA corresponding to mRNA) or protein (NP_000509)

These references should be more reliable data

Gene name search oncomodulin

Gene Name Search: Oncomodulin

Returns list of gene entries for oncomodulin for different organisms

Click on a highlighted link to see details

Gene name search select human oncomodulin

Gene Name Search: Select Human Oncomodulin

Summary of all gene information, including mapping (when available). Note that this sequence has been validated as a RefSeq.

Scrolling down, you can find link to protein data through UniProt. 

Gene name search link to oncomodulin protein

Gene Name Search: Link to Oncomodulin Protein

Protein name search oncomodulin

Protein Name Search: Oncomodulin

Notice that I filtered this search so that results show only human oncomodulin

Biology 4900

You can change the display (as shown)…

Biology 4900

FASTA format:

versatile, compact with one header line

followed by a string of nucleotides or amino acids

in the single letter code

Biology 4900

Comparison of Gene to other resources

Gene: collects key information on each gene/protein from major databases. It covers all major organisms.

UniGene: Database with information on where in a body, when in development, and how abundantly a transcript is expressed

HomoloGene: Gathers information on sets of related proteins based on common genetic ancestry.

Homologene name search oncomodulin

Homologene Name Search: Oncomodulin

Provides list of homologous (related) genes

Homologene name search oncomodulin1

Homologene Name Search: Oncomodulin

Shows conserved domains of protein sequences. If you click on graphic, takes you to summary of domain/family information.

Expasy to access protein and dna sequences

ExPASy to access protein and DNA sequences

  • ExPASy (Expert Protein Analysis System) sequence retrieval system

  • Visit

  • Similar to Entrez for NCBI

Example: Search for calmodulin

Jump to Prosite

Biology 4900


a centralized protein database (

This is separate from NCBI, and interlinked.

Uniprot calmodulin

UniProt: Calmodulin

  • Search Results for bovine calmodulin (P62157)

Protein secondary structure pdbsum embl ebi

Protein Secondary Structure: PDBSum (EMBL-EBI)


  • Either enter PDB file or can load new/existing sequence

Biology 4900

ExPASy: vast proteomics resources (

Biology 4900

Genome Browsers

Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information.

The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is not commonly used.

Biology 4900

Ensembl genome browser (



Biology 4900


beta globin

Biology 4900

Ensembl output for

beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom).

There are various horizontal annotation tracks.

Biology 4900

The UCSC Genome Browser

  • This browser’s focus is on humans and other eukaryotes

  • you can select which tracks to display (and how much information for each track)

  • tracks are based on data generated by the UCSC team and by the broad research community

  • you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it

  • The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

Biology 4900

[1] Visit, click Genome Browser

[2] Choose organisms, enter query (beta globin), hit submit

Biology 4900

[4] On the UCSC Genome Browser:

--choose which tracks to display

Protein databases

Protein Databases

  • What do they contain?

    • Amino acid sequences

      • Primary sequence

        • Direct submissions - protein sequencing

        • SWISS-PROT, PIR

      • Secondary sequence

        • Translations - putative proteins resulting from modifying (i.e. intron splicing) nucleic acid sequence

        • GenPept, TrEMBL

      • Structure

        • Protein Data Bank

    • Annotations

      • Function, domains, etc.

Swiss prot


  • Created by Amos Bairoch in 1986 at the Department of Medical Biochemistry in Geneva

  • Maintained by the Swiss Institute of Bio-informatics (SIB) and funded by GeneBio

  • Few redundancies

  • Direct submission (from sequencing, not translation)


  • PIR (The Protein Information Resource) was created by M.O. Dayhoff in 1965

  • Maintained by many

  • In 2004, joined with other databases (Swiss-Prot and TrEMBL) to become part of the UniProt consortium

Protein data bank

Protein Data Bank

  • Archive of 3-D structural data of biological macromolecules

  • Based on experimental data

  • Managed by the Research Collaboratory for Structural Bioinformatics (RCSB)

    • Rutgers & UCSD

  • As of January 11, 2012 contained 78477 structures

  • ~ 5000 membrane proteins

Pdb source of protein sequence and structure data

PDB: Source of protein sequence and structure data

  • Login