Contents of this talk
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Contents of this Talk PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International

Download Presentation

Contents of this Talk

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Contents of this talk

Contents of this Talk

  • [Used as intro to Genome Databases Seminar, 2002]

  • Overview of bioinformatics

  • Motivations for genome databases

  • Analogy of virus reverse-eng to genome analysis

  • Questions to ask of a genome DB


Overview of genome databases

Overview of Genome Databases

Peter D. Karp, Ph.D.

SRI International

[email protected]

www-db.stanford.edu/dbseminar/seminar.html


Talk overview

Talk Overview

  • Definition of bioinformatics

  • Motivations for genome databases

  • Computer virus analogy

  • Issues in building genome databases


Definition of bioinformatics

Definition of Bioinformatics

  • Computational techniques for management and analysis of biological data and knowledge

    • Methods for disseminating, archiving, interpreting, and mining scientific information

  • Computational theories of biology

  • Genome Databases is a subfield of bioinformatics


Motivations for bioinformatics

Motivations for Bioinformatics

  • Growth in molecular-biology knowledge (literature)

  • Genomics

    • Study of genomes through DNA sequencing

    • Industrial Biology


Example genomics datatypes

Example Genomics Datatypes

  • Genome sequences

    • DOE Joint Genome Institute

      • 511M bases in Dec 2001

      • 11.97G bases since Mar 1999

  • Gene and protein expression data

  • Protein-protein interaction data

  • Protein 3-D structures


Genome databases

Genome Databases

  • Experimental data

    • Archive experimental datasets

    • Retrieving past experimental results should be faster than repeating the experiment

    • Capture alternative analyses

    • Lots of data, simpler semantics

  • Computational symbolic theories

    • Complex theories become too large to be grasped by a single mind

    • The database is the theory

    • Biology is very much concerned with qualitative relationships

    • Less data, more complex semantics


Bioinformatics

Bioinformatics

  • Distinct intellectual field at the intersection of CS and molecular biology

  • Distinct field because researchers in the field should know CS, biology, and bioinformatics

  • Spectrum from CS research to biology service

  • Rich source of challenging CS problems

  • Large, noisy, complex data-sets and knowledge-sets

  • Biologists and funding agencies demand working solutions


Bioinformatics research

Bioinformatics Research

  • algorithms + data structures = programs

  • algorithms + databases = discoveries

  • Combine sophisticated algorithms with the right content:

    • Properly structured

    • Carefully curated

    • Relevant data fields

    • Proper amount of data


Goals of systems biology

Goals of Systems Biology

  • Catalog the molecular parts lists of cells

  • Understand the function(s) of each part

  • Understand how those parts interact to produce the behavior of a cell or organism

  • Understand the evolution of those molecular parts


Analogy genome analysis and virus analysis

Analogy: Genome Analysis andVirus Analysis

  • Given: Virus binary executable file for known machine architecture

  • Reverse engineer the program

    • Procedures

    • Call graph

    • Specifications for I/O behavior of the program and all procedures

  • Capture and publish an annotated analysis of the virus

  • Comparative analysis of related viruses


Genome analysis

Genome Analysis

  • Example: M. tuberculosis genome

  • Given: 4.4Mbp of DNA (genome)

  • Infer:

    • Molecular parts list of Mtb

    • A model of the biochemical machinery of Mtb cell

  • DNA is a blueprint for the program of life


Start

Start

4.4Mbyte binary program

4.4Mbp DNA sequence


Step 1

Step 1

Distinguish code from data segments

Find procedure boundaries

Distinguish coding from non-coding regions –

Gene Finding


Step 2

Step 2

Predict semantics of procedures

A

C

B

D

Predict gene functions


Step 3

Step 3

Predict procedure call graph

D

A

B

C

A

C

B

D

D

A

B

C

Predict biochemical and gene networks


Step 4

Step 4

Predict conditions under which procedures are invoked

D

Q

R

A

B

S

C

Predict expression of network fragments


Step 5

Step 5

Infer complete program specification

Formulate dynamic cellular simulation


Step 6

Step 6

Internet publishing of structured program

annotation with explanations, references,

commentary

Internet publishing of structured genome

annotation with explanations, references,

commentary


Step 7

Step 7

Comparative analysis of viruses

Evolutionary relationships among viruses

Comparative analysis of genomes

Evolutionary relationships among genomes


Step 8

Step 8

Identify measures to disable virus or prevent its spread

D

Q

R

A

B

S

C

Identify target proteins for anti-microbial drug discovery


Database of viruses

Database of Viruses

  • Create a database that stores

    • Binaries for all viruses

    • All annotation of virus programs by different investigators

    • Comparative analyses

  • Support

    • Remote API access

    • Click-at-a-time browsing


Reference on major genome databases

Reference on Major Genome Databases

  • Nucleic Acids Research Database Issue

  • http://nar.oupjournals.org/content/vol30/issue1/

    • 112 databases


Questions to ask of a new genome database

Questions to Ask of a New Genome Database


What are database goals and requirements

What are Database Goals andRequirements?

  • How many users?

  • What expertise do users have?

  • What problems will database be used to solve?


What is its organizing principle

What is its Organizing Principle?

  • Different DBs partition the space of genome information in different dimensions

  • Experimental methods (Genbank, PDB)

  • Organism (EcoCyc, Flybase)


What is its level of interpretation

What is its Level of Interpretation?

  • Laboratory data

  • Primary literature (Genbank)

  • Review (SwissProt, MetaCyc)

  • Does DB model disagreement?


What are its semantics and content

What are its Semantics and Content?

  • What entities and relationships does it model?

  • How does its content overlap with similar DBs?

  • How many entities of each type are present?

  • Sparseness of attributes and statistics on attribute values


What are sources of its data

What are Sources of its Data?

  • Potential information sources

    • Laboratory instruments

    • Scientific literature

      • Manual entry

      • Natural-language text mining

    • Direct submission from the scientific community

      • Genbank

  • Modification policy

    • DB staff only

    • Submission of new entries by scientific community

    • Update access by scientific community


What dbms is employed

What DBMS is Employed?

  • None

  • Relational

  • Object oriented

  • Frame knowledge representation system


Distribution user access

Distribution / User Access

  • Multiple distribution forms enhance access

  • Browsing access with visualization tools

  • API

  • Portability


What validation approaches are employed

What Validation Approaches areEmployed?

  • None

  • Declarative consistency constraints

  • Programmatic consistency checking

  • Internal vs external consistency checking

  • What types of systematic errors might DB contain?


Database documentation

Database Documentation

  • Schema and its semantics

  • Format

  • API

  • Data acquisition techniques

  • Validation techniques

  • Size of different classes

  • Coverage of subject matter

  • Sparseness of attributes

  • Error rates


Relationship of database field to bioinformatics

Relationship of Database Field toBioinformatics

  • Scientists generally ignorant of basic DB principles

    • Complex queries vs click-at-a-time access

    • Data model

    • Defined semantics for DB fields

    • Controlled vocabularies

    • Regular syntax for flatfiles

    • Automated consistency checking

  • Most biologists take one programming class

  • Evolution of typical genome database

  • Finer points of DB research off their radar screen

  • Handfull of DB researchers work in bioinformatics


Database field

Database Field

  • For many years, the majority of bioinformatics DBs did not employ a DBMS

    • Flatfiles were the rule

    • Scientists want to see the data directly

    • Commercial DBMSs too expensive, too complex

    • DBAs too expensive

  • Most scientists do not understand

    • Differences between BA, MS, PhD in CS

    • CS research vs applications

    • Implications for project planning, funding, bioinformatics research


Recommendation

Recommendation

  • Teaching scientists programming is not enough

  • Teaching scientists how to build a DBMS is irrelevant

  • Teach scientists basic aspects of databases and symbolic computing

    • Database requirements analysis

    • Data models, schema design

    • Knowledge representation, ontologies

    • Formal grammars

    • Complex queries

    • Database interoperability


  • Login