slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sajid Khan PowerPoint Presentation
Download Presentation
Sajid Khan

Loading in 2 Seconds...

play fullscreen
1 / 16

Sajid Khan - PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on

Sajid Khan . Chapter 2. Databases. Bioinformatics databases. Database Nomenclature. . WHAT IS DATABASE ? . Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents. Database Models.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sajid Khan' - denna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide4

Database Nomenclature.

WHAT IS DATABASE ?

Data repositories, data marts, and data warehouses differ primarily in the diversity of data sources that contribute to their contents.

slide5

Database Models

Defines data organization (schema)

Relational

Entities and relationships stored in tables

Predefined schema

Examples: Oracle, DB2, MySQL, PostgreSQL

Object-oriented

Stores data as objects (i.e., structures with predefined type)

Examples: Versant, Jasmine, Objectivity

Semi-structured

Schema dynamically defined within data (self-describing)

Flexible description of data with complex relationships

Example: XML databases

Bioinformatic Databases

Useful information

0 DNA sequences

0 Conserved DNA domains

0 Genomes

0 Gene expression (ESTs, microarrays)

0 Protein sequences

0 Protein 3D structure

0 Protein families

0 Mutations / polymorphisms / SNPs

0 Metabolic pathways

0 Chemical compounds (ligands)

0 Biomedical literature (journal papers, online books…)

slide6

Classification schemes

  • 0 Database design – relational, object-oriented…
  • 0 Data type – DNA, RNA, EST, protein…
  • 0 Organism – bacteria, virus, human…
  • 0 Accessibility – public, academic, commercial
  • 0 Data source – primary, derived
  • 0 Data entry – manually curated, computational derived
  • 0 Focus – sequence-oriented, gene-oriented
  • Resulting in many bioinformatics databases…
  • Bioinformatics Databases Issues
  • Naming
  • 0 Multiple names for same chemical
  • Arising from multiple
  • biological disciplines, conventions
  • 0 Example
slide7

Bioinformatics Databases Issues

  • Redundancy
  • 0Multiple entries for same DNA / protein sequence
  • 0Arising from multiple experiments & biological disciplines
  • Example – redundant GenBank entries for E. coli dUTPase
  • 4 separate publications (X01714, V01578, L10328, AE000441)
  • Data annotation & formats
  • 0Multiple data for single geneSequence, location, expression, structure, function…
  • 0Resulting in multiple data annotations & formats
  • Data integration
  • 0Combining data from multiple bioinformaticdatabases
  • Redundancy
  • 0 Multiple entries for same DNA / protein sequence
  • 0 Arising from multiple experiments & biological disciplines
  • 0 Example – redundant GenBank entries for E. coli dUTPase
  • 4 separate publications (X01714, V01578, L10328, AE000441)
  • Data annotation & formats
  • 0 Multiple data for single gene
  • Sequence, location, expression, structure, function…
  • 0 Resulting in multiple data annotations & formats
  • Data integration
  • 0 Combining data from multiple bioinformatic databases
slide8

Major Bioinformatics Databases

DNA sequences

0 GenBank, RefSeq, UniGene

􀂋 Protein sequences

0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq

􀂋 Protein structure

0 Protein Data Bank (PDB)

􀂋 Gene expression

0 Gene Expression Omnibus (GEO)

􀂋 Biomedical publications

0 PubMed / MedLine

Derived databases

0 Compiled from data in primary databases

0 Manually curated (human selection & correction)

􀁺 Advantages – high quality

􀁺 Disadvantages – high expense, low volume

􀁺 Examples

􀂋 Swiss-Prot, PIR-PSD, RefSeq

0 Computational derivation (automatically generated)

􀁺 Advantages – inexpensive, up-to-date

􀁺 Disadvantages – lower quality

􀁺 Examples

􀂋 GenPept, TrEMBL, UniGene, COGs

Primary databases

0 Original submissions by researchers

0 Staff organizes information only

0 Generally sequence oriented

0 Examples

􀁺 GenBank, PDB

slide9

Data Management /Databases Connection

Data Management. In this data-management scenario for a pharmacogenomiclaboratory, data of various types are acquired from a

variety of sources, incorporated into the data warehouse, used by a variety of applications, and archived for future use. Data created locally may be published electronically, serve as the basis for a paper publication, and may be used in a variety of applications, from drug discovery to genetic

engineering.

slide10

Pharmacogenomicsand Aggression

Typical Electronic Medical Record (EMR) Contents.

The EMR contains both objective signs, such as physical examination findings, as well as subjective patient symptoms, including chief complaint and review of systems.

Aggression

To illustrate the data-management issues associated with a biotech research effort that depends on multiple, disparate systems and accompanying databases, assume that the laboratory depicted

focuses on understanding the genetic basis for aggression, with a goal of creating new, more effective medications to control the behavior.

slide11

RNA-Protein Codon Transcription Wheel.

The 64 possible codons represent the 20 common amino acids, as well as one start (ATG) and three stop (TAG, TAA and TGA) markers. Redundancies normally occur in the last nucleotide of the three-letter alphabet.

slide12

Integration of Clinical Data

To create an EMR capable of supporting efficient data mining, a data dictionary is used to impose a standard format and vocabulary on data stored in the clinical data mart.

Integration of Bioinformatics Data. Like clinical data, bioinformatics data from a variety of sources and in numerous formats are combined in a data mart to enhance data management.

slide13

From Data to Knowledge

Search / retrieval tool for multiple linked databases

0 Papers biomedical literature (PubMed)

0 Nucleotide sequence database (GenBank)

0 Protein sequence database

0 Structure 3D macromolecular structures

0 Genome complete genome assemblies

0 OMIM Online Mendelian Inheritance in Man

0 Taxonomy organisms in GenBank

0 ProbeSet gene expression and microarray datasets

Common identifiers for bioinformatic data

0 Locus name

0 Accession numbers

0 GenInfo ID

0 PubMed ID

Original identifiers of GenBank records

0 LOCUS line in GenBank entries

Originally

0 First 3 letters of organism followed by code for gene

Example

0 HUMBB for human ß-globin region

Problems

0 Unmaintainable due to growth of data

0 Homologous genes not named the same

Data is stored / presented in a variety of formats

0 FASTA

0 GenBank

0 SwissProt

0 ASN.1

0 XML

slide14

Bioinformatic Database Formats

Defined by SWISS-PROT database

0 Includes annotation, other info

􀂋 Example

ID BRC1_MOUSE STANDARD; PRT; 1812 AA.

AC P48754; Q60957; Q60983;

DT 01-FEB-1996 (Rel. 33, Created)

DT 01-NOV-1997 (Rel. 35, Last sequence update)

DT 16-OCT-2001 (Rel. 40, Last annotation update)

DE Breast cancer type 1 susceptibility protein homolog.

GN BRCA1.

OS Musmusculus (Mouse).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

OX NCBI_TaxID=10090;

RN [1]

RP SEQUENCE FROM N.A.

RC STRAIN=C57BL/6; TISSUE=Embryo;

RX MEDLINE=96177659; PubMed=8634697;

RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;

RT "Mouse Brca1: localization sequence analysis and identification of

RT evolutionarily conserved domains.";

RL Hum. Mol. Genet. 4:2265-2273(1995)…

Used by FASTA tools

􀂋 Comment line followed by sequence data

0 No annotation, just sequence

􀂋 Example

>gi|1040960|gb|U35641.1|MMU35641 Musmusculus Brca1 mRNA…

GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC GATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC

GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT

CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT

TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT

TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG

Flat file format used by GenBank

0 Annotation, author, version, etc…

􀂋 Example (just the top)

LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996

DEFINITION Musmusculus Brca1 mRNA, complete cds.

ACCESSION U35641

VERSION U35641.1 GI:1040960

KEYWORDS .

SOURCE house mouse strain=C57Bl/6.

ORGANISM Musmusculus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 5538)

AUTHORS Sharan,S.K., Wims,M. and Bradley,A.

TITLE Murine Brca1: sequence and significance for human missense

mutations

JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)

MEDLINE 96177660

PUBMED 8634698

slide15

Bioinformatic Database Formats

International standard

0 Semi-structured format

0 Base format for NCBI data

􀂋 Example

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Musmusculus Brca1 mRNA, and translated products" ,

source {

org {

taxname "Musmusculus" ,

db {

{

db "taxon" ,

tag

id 10090 } } ,

orgname {

name

binomial {

genus "Mus" ,

species "musculus" } , …

eXtensibleMarkup Language

0 Open standard for semi-structured data, uses tags like HTML

0 Document split into content (XML), style (XSL), linking (XLL)

􀂋 Example

<?xml version="1.0"?>

<!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN"

“http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">

<GBSet>

<GBSeq>

<GBSeq_locus>MMU35641</GBSeq_locus>

<GBSeq_length>5538</GBSeq_length>

<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>

<GBSeq_moltype value="mrna">5</GBSeq_moltype>

<GBSeq_topology value="linear">1</GBSeq_topology>

<GBSeq_division>ROD</GBSeq_division>

<GBSeq_update-date>18-OCT-1996</GBSeq_update-date>

<GBSeq_create-date>25-OCT-1995</GBSeq_create-date>

<GBSeq_definition>Musmusculus Brca1 mRNA, complete cds</GBSeq_definition>

<GBSeq_primary-accession>U35641</GBSeq_primary-accession>

<GBSeq_accession-version>U35641.1</GBSeq_accession-version>

Format conversion

0 Frequently tools handle only one of the data formats

0 Use software to transform between formats

􀁺 ReadSeq, SeqIO

􀂋 Perl (Practical Extraction and Report Language)

0 Portable C-like interpreted scripting language

0 Powerful pattern matching, string processing operations

0 Frequently used to extract / process bioinformatic data

􀂋 BioPerl

0 Collection of Perl classes designed for bioinformatic tools

0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS…

slide16

Thank You

Contact : gpgcm_bc ( Yahoo Group )