slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Sajid Khan PowerPoint Presentation
Download Presentation
Sajid Khan

Loading in 2 Seconds...

play fullscreen
1 / 16

Sajid Khan - PowerPoint PPT Presentation

  • Uploaded on

Sajid Khan . Chapter 2. Databases. Bioinformatics databases. Database Nomenclature. . WHAT IS DATABASE ? . Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents. Database Models.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Sajid Khan' - denna

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Database Nomenclature.


Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents.


Database Models

Defines data organization (schema)


Entities and relationships stored in tables

Predefined schema

Examples: Oracle, DB2, MySQL, PostgreSQL


Stores data as objects (i.e., structures with predefined type)

Examples: Versant, Jasmine, Objectivity


Schema dynamically defined within data (self-describing)

Flexible description of data with complex relationships

Example: XML databases

Bioinformatic Databases

Useful information

0 DNA sequences

0 Conserved DNA domains

0 Genomes

0 Gene expression (ESTs, microarrays)

0 Protein sequences

0 Protein 3D structure

0 Protein families

0 Mutations / polymorphisms / SNPs

0 Metabolic pathways

0 Chemical compounds (ligands)

0 Biomedical literature (journal papers, online books…)


Classification schemes

  • 0 Database design – relational, object-oriented…
  • 0 Data type – DNA, RNA, EST, protein…
  • 0 Organism – bacteria, virus, human…
  • 0 Accessibility – public, academic, commercial
  • 0 Data source – primary, derived
  • 0 Data entry – manually curated, computational derived
  • 0 Focus – sequence-oriented, gene-oriented
  • Resulting in many bioinformatics databases…
  • Bioinformatics Databases Issues
  • Naming
  • 0 Multiple names for same chemical
  • Arising from multiple
  • biological disciplines, conventions
  • 0 Example

Bioinformatics Databases Issues

  • Redundancy
  • 0Multiple entries for same DNA / protein sequence
  • 0Arising from multiple experiments & biological disciplines
  • Example – redundant GenBank entries for E. coli dUTPase
  • 4 separate publications (X01714, V01578, L10328, AE000441)
  • Data annotation & formats
  • 0Multiple data for single geneSequence, location, expression, structure, function…
  • 0Resulting in multiple data annotations & formats
  • Data integration
  • 0Combining data from multiple bioinformaticdatabases
  • Redundancy
  • 0 Multiple entries for same DNA / protein sequence
  • 0 Arising from multiple experiments & biological disciplines
  • 0 Example – redundant GenBank entries for E. coli dUTPase
  • 4 separate publications (X01714, V01578, L10328, AE000441)
  • Data annotation & formats
  • 0 Multiple data for single gene
  • Sequence, location, expression, structure, function…
  • 0 Resulting in multiple data annotations & formats
  • Data integration
  • 0 Combining data from multiple bioinformatic databases

Major Bioinformatics Databases

DNA sequences

0 GenBank, RefSeq, UniGene

􀂋 Protein sequences

0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq

􀂋 Protein structure

0 Protein Data Bank (PDB)

􀂋 Gene expression

0 Gene Expression Omnibus (GEO)

􀂋 Biomedical publications

0 PubMed / MedLine

Derived databases

0 Compiled from data in primary databases

0 Manually curated (human selection & correction)

􀁺 Advantages – high quality

􀁺 Disadvantages – high expense, low volume

􀁺 Examples

􀂋 Swiss-Prot, PIR-PSD, RefSeq

0 Computational derivation (automatically generated)

􀁺 Advantages – inexpensive, up-to-date

􀁺 Disadvantages – lower quality

􀁺 Examples

􀂋 GenPept, TrEMBL, UniGene, COGs

Primary databases

0 Original submissions by researchers

0 Staff organizes information only

0 Generally sequence oriented

0 Examples

􀁺 GenBank, PDB


Data Management /Databases Connection

Data Management. In this data-management scenario for a pharmacogenomiclaboratory, data of various types are acquired from a

variety of sources, incorporated into the data warehouse, used by a variety of applications, and archived for future use. Data created locally may be published electronically, serve as the basis for a paper publication, and may be used in a variety of applications, from drug discovery to genetic



Pharmacogenomicsand Aggression

Typical Electronic Medical Record (EMR) Contents.

The EMR contains both objective signs, such as physical examination findings, as well as subjective patient symptoms, including chief complaint and review of systems.


To illustrate the data-management issues associated with a biotech research effort that depends on multiple, disparate systems and accompanying databases, assume that the laboratory depicted

focuses on understanding the genetic basis for aggression, with a goal of creating new, more effective medications to control the behavior.


RNA-Protein Codon Transcription Wheel.

The 64 possible codons represent the 20 common amino acids, as well as one start (ATG) and three stop (TAG, TAA and TGA) markers. Redundancies normally occur in the last nucleotide of the three-letter alphabet.


Integration of Clinical Data

To create an EMR capable of supporting efficient data mining, a data dictionary is used to impose a standard format and vocabulary on data stored in the clinical data mart.

Integration of Bioinformatics Data. Like clinical data, bioinformatics data from a variety of sources and in numerous formats are combined in a data mart to enhance data management.


From Data to Knowledge

Search / retrieval tool for multiple linked databases

0 Papers biomedical literature (PubMed)

0 Nucleotide sequence database (GenBank)

0 Protein sequence database

0 Structure 3D macromolecular structures

0 Genome complete genome assemblies

0 OMIM Online Mendelian Inheritance in Man

0 Taxonomy organisms in GenBank

0 ProbeSet gene expression and microarray datasets

Common identifiers for bioinformatic data

0 Locus name

0 Accession numbers

0 GenInfo ID

0 PubMed ID

Original identifiers of GenBank records

0 LOCUS line in GenBank entries


0 First 3 letters of organism followed by code for gene


0 HUMBB for human ß-globin region


0 Unmaintainable due to growth of data

0 Homologous genes not named the same

Data is stored / presented in a variety of formats


0 GenBank

0 SwissProt

0 ASN.1



Bioinformatic Database Formats

Defined by SWISS-PROT database

0 Includes annotation, other info

􀂋 Example


AC P48754; Q60957; Q60983;

DT 01-FEB-1996 (Rel. 33, Created)

DT 01-NOV-1997 (Rel. 35, Last sequence update)

DT 16-OCT-2001 (Rel. 40, Last annotation update)

DE Breast cancer type 1 susceptibility protein homolog.


OS Musmusculus (Mouse).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

OX NCBI_TaxID=10090;

RN [1]



RX MEDLINE=96177659; PubMed=8634697;

RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;

RT "Mouse Brca1: localization sequence analysis and identification of

RT evolutionarily conserved domains.";

RL Hum. Mol. Genet. 4:2265-2273(1995)…

Used by FASTA tools

􀂋 Comment line followed by sequence data

0 No annotation, just sequence

􀂋 Example

>gi|1040960|gb|U35641.1|MMU35641 Musmusculus Brca1 mRNA…






Flat file format used by GenBank

0 Annotation, author, version, etc…

􀂋 Example (just the top)

LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996

DEFINITION Musmusculus Brca1 mRNA, complete cds.


VERSION U35641.1 GI:1040960


SOURCE house mouse strain=C57Bl/6.

ORGANISM Musmusculus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 5538)

AUTHORS Sharan,S.K., Wims,M. and Bradley,A.

TITLE Murine Brca1: sequence and significance for human missense


JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)

MEDLINE 96177660

PUBMED 8634698


Bioinformatic Database Formats

International standard

0 Semi-structured format

0 Base format for NCBI data

􀂋 Example

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Musmusculus Brca1 mRNA, and translated products" ,

source {

org {

taxname "Musmusculus" ,

db {


db "taxon" ,


id 10090 } } ,

orgname {


binomial {

genus "Mus" ,

species "musculus" } , …

eXtensibleMarkup Language

0 Open standard for semi-structured data, uses tags like HTML

0 Document split into content (XML), style (XSL), linking (XLL)

􀂋 Example

<?xml version="1.0"?>







<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>

<GBSeq_moltype value="mrna">5</GBSeq_moltype>

<GBSeq_topology value="linear">1</GBSeq_topology>




<GBSeq_definition>Musmusculus Brca1 mRNA, complete cds</GBSeq_definition>



Format conversion

0 Frequently tools handle only one of the data formats

0 Use software to transform between formats

􀁺 ReadSeq, SeqIO

􀂋 Perl (Practical Extraction and Report Language)

0 Portable C-like interpreted scripting language

0 Powerful pattern matching, string processing operations

0 Frequently used to extract / process bioinformatic data

􀂋 BioPerl

0 Collection of Perl classes designed for bioinformatic tools

0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS…


Thank You

Contact : gpgcm_bc ( Yahoo Group )