database technology in bioinformatics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Database Technology in Bioinformatics PowerPoint Presentation
Download Presentation
Database Technology in Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 55

Database Technology in Bioinformatics - PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on

Database Technology in Bioinformatics. Philip McNeil. European Bioinformatics Institute. The Information Challenge Database Technologies Which Do You Choose? Data Modelling Some Database Features In Use at EBI. The Information Challenge. Many new data intensive methodologies

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Database Technology in Bioinformatics' - burt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
database technology in bioinformatics

Database Technology in Bioinformatics

Philip McNeil

European Bioinformatics Institute

slide2
The Information Challenge

Database Technologies

Which Do You Choose?

Data Modelling

Some Database Features In Use at EBI

today s research
Many new data intensive methodologies

Combinatorial Chemistry

Genomics (including Structural Genomics)

High Throughput Screening

Proteomics

Transgenics

Microarrays

Today’s Research

Generates ever increasing amounts of data:

data waves
Size

Complexity

Integration

‘Data Waves’
molecular biology has become information intensive

G

R

Y

S

P

L

E

M

CAGTAGTGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC

G

R

Y

S

P

L

N

M

CAGTAAAGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC

Molecular biology has become information intensive
macromolecular information
Nucleotide Sequence

Protein Sequence

Protein Structure

Protein Function

Macromolecular Information

Data Complexity

proliferation of databases
Biological information all interrelated

DNA sequence, Protein sequence, Structure, Function

Specialist database for organisms

HIV, Drosophila, C. Elegans

Specialist databases for functions

Eukaryotic promoters, Transcription factors

Specialist databases for diseases and genes

P53, Haemophilia B

Proliferation of Databases
slide12
Artificial boundaries between databases

Coarse links between databases

Multitude of exchange formats

Lack of robustness

Varied quality

But…...
meeting the information challenge
Improved quality and integrity of data

Data need to be well structured and robustly defined

Flexible infrastructure to meet rapid changing requirements

Open frameworks, management and analysis tools

Integrate diverse data sources

Meeting the Information Challenge
evolution of dbms technology
Evolution of DBMS Technology

Adapted from: Barry - The Object Database Handbook (1996)

database systems in use today
Essentially four different types:

File System (Flat Files)

Relational database (RDBMS)

Object oriented database (ODBMS)

Object-relational database (ORDBMS)

Coming?

XML

Database Systems In Use Today
file systems
All computers have them!

Most of the world’s data still consists of old file systems and legacy data

Many bioinformatics databases are still distributed as flat files:

EMBL-Bank

MSD/PDB

SWISS-PROT/TrEMBL

File Systems
relational databases
Well understood, mature technology

Most widely used DBMS

Standards:

SQL92

although all vendors used proprietary extensions

SQL99

Support for objects & other extensions

SQL2003

XML

Relational Databases
object relational databases
Extended the SQL92 data model:

User defined, complex data types

Types, subtypes, inheritance

References (‘OIDs’)

Now supported by the SQL99 standard

Many major relational databases now have object extensions

Object-Relational Databases
object oriented databases
Persistent data store for objects created by object oriented programming languages

Language binding: C++, Smalltalk, Java

Standard:

Object query language (OQL)

Not implemented by many vendors

Now mainly used in niche areas

CAD/CAM, AI, telecomms

Object Oriented databases
xml databases
Three different types defined by the XML:DB initiative:

Native XML Database

XML Enabled Database

Hybrid XML Database

XML Databases
native xml database
Defines a model for an XML document and stores and retrieves documents according to that model

XML document is the fundamental unit of storage (cf. row in a relational database)

Can be built on various underlying storage models (RDBMS, OODBMS, indexed compressed files)

Native XML Database
xml enabled database
Has an added XML mapping layer

Original XML metadata & structure may be lost

Data retrieved as XML may not have originated in XML

Data manipulation via e.g. DOM or SAX or via SQL

Oracle, Microsoft, IBM use this approach

XML Enabled Database
flat files
Can store complex data (e.g. PDB)

Can be indexed (at least for simple datatypes)

Can be made publicly available in simple form

Well suited for human browsing

Avoid cost of database software

Platform independent

Easy to prepare for WWW

Cheap

Flat Files
limitations of flat file database
Low data reliability, security & integrity

decentralised data and therefore decentralised control

Inadequate data structuring

difficult to provide adequate model of ‘the real world’

variety of formats - lack of robustness

Difficult to get answers to ad-hoc queries

no query language; data files are distinct

sophisticated query tools have been developed - e.g. SRS

Low responsiveness to change

data and programs are not independent

hard to integrate

Limitations of Flat File ‘Database’
benefits of relational databases
Store large amounts of relatively simple data as tables of rows & columns

Scalability

Sound theoretical basis

High security & reliability

Performance

Query optimization

Parallel processing

Strong support, tools, etc.

Benefits of Relational Databases
limitations of relational databases
Cannot adequately support complex data

complex data are stored as ‘BLOBS’

BLOBS can be retrieved, but not searched, indexed or manipulated

Restricted set of data types even for less complex data

Numbers, character strings, dates

An inadequate model of ‘the real world’

entity/relationship model

loss of semantics

Expensive

Limitations of Relational Databases
benefits of object relational databases
Close to relational model, but benefit from some OO concepts

Can handle large amounts of complex data

‘smart BLOBS’

Plug-in extensibility

cartridges & datablades

Good ad-hoc query capability: SQL99

High security & reliability

Benefits of Object-Relational Databases
limitations of object relational databases
A compromise solution, merging two paradigms

underlying model still relational

Less than perfect support for object extensions

Even more expensive

Limitations of Object-Relational Databases
benefits of odbmss
Support complex data structures

Provide a much better implementation of ‘the real world’ model

object oriented models map well - support for OO concepts

little loss of semantics

Vendor-specific:

Good performance?

Scalability?

Ad-hoc queries are possible with OQL

Closely integrated with programming languages

Benefits of ODBMSs
limitations of odbmss
Hard to learn?

Vendor Specific:

Difficult to query?

Few currently support OQL, a few support SQL

Queries may have to be written in a 3GL, e.g. C++

Performance?

Security?

Reliability?

Scalability?

Backup & Recovery?

Also expensive

Few tools

Limitations of ODBMSs
which do you choose1
It depends on the data and what you want to do with them:Which do you choose?

Michael Stonebraker: “Object-relational DBMS - The Next Wave”, Illustra whitepaper, http://www.informix.com/informix/corpinfo/zines/whitpprs

dbms to store flat files to communicate
Most major data repositories in molecular biology have moved to using commercial RDBMS packages to manage their collections

Most groups still collect and deliver the information using flat file protocols and formats – XML is becoming dominant here

DBMS to store, flat files to communicate
modelling comes first
Start with a conceptual model

Can be done using different approaches

objects

entity relationship

This can be implemented using different physical database systems and programming languages (not always without difficulty!)

Remember - your database will only be as good as the data model it supports

Modelling Comes First
slide39
UML stands for Unified Modeling Language

The UML combines elements from

Data Modelling concepts (Entity Relationship Diagrams)

Business Modelling (work flow)

Object Modelling

Component Modelling

UML is the OMG standard language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system

UML
slide40

Study

  • Classes
  • Attributes
  • Links
  • Operations
    • Set and get (implicit)
  • Checks and constraints

+details: String

1

1

*

Experiment

*

Conditions

+serial: Int

*

+name: String

+serial: Int

+ndim: Int

+temperature: Float

+details: String

+pH: Float

+__init__()

1

*

ExpDim

+dim: Int

UML: Basics

slide43

Sequence

Schema

the msd data warehouse
Database designed for queries and analysis

Facilitate the synchronisation with other databases

Repository for derived data

Modular

The MSD Data Warehouse
slide47

Deposition

Deposition

Stage1

Warehouse

replication

transformation

Search-Warehouse

replication

distribution

From Deposition to Distribution

slide48

Exp. Result

Assembly

Chains

Residues

Atoms

CHAIN

ENTRY

ASSEMBLY

ALT

ASSEMBLY

DATA

RESIDUE

ATOM

DATA

ATOM

MODEL

Representing Macromolecular Structures

technical details
Need for staging databases

Transformation Mechanism

Deposition – Warehouse refresh

Replication/Distribution

Query optimisation

Interfacing with the warehouse(API-web-management tools)

Technical Details
transformation
Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented model

Based on a flexible metadata driven mechanism, in-house developed to overcome Oracle limitations

Models composite entities and their dependencies

Allows incremental transformation for complex queries

Transformation
replication
Mechanism to allow incremental replication and distribution of the data warehouse

End user has just to periodically run a script that downloads and and applies modifications

On different platforms and database vendors, from MS Access and MySQL up to Oracle

Mechanism flexible enough to adapt on different schemas

Replication
derived data
Mechanism for deriving and keeping up to date additional information

Altering existing software to read and write to the warehouse instead of flat files

Linking Doss with sql-wrapper library to derive secondary structure

Active-site information

Derived Data
slide55

Integration & Distribution of Sequence & Structure Data

Common Domains

Common Domains

definition

definition

GO

GO

Mapping

Mapping

curated

curated

by MSD group

by MSD group

CATH

CATH

PDB

PDB

SCOP

SCOP

ATOMS <

ATOMS <

-

-

> SEQRES

> SEQRES

HMM predicted

HMM predicted

Mapping start

Mapping start

Curated

Curated

By SWISS

By SWISS

-

-

Mapping

Mapping

Curated

Curated

end by SWISS

end by SWISS

-

-

by

by

PROT group

PROT group

residue per

residue per

by SCOP

by SCOP

PROT group

PROT group

PROSITE

PROSITE

residue by

residue by

group

group

SWISS

SWISS

-

-

PROT /

PROT /

group

MSD group

MSD group

TrEMBL

TrEMBL

GO

GO

PFAM

PFAM

PROSITE

PROSITE

Curated

Curated

Curated

Curated

InterPro

InterPro

GO

GO