contents of this talk
Download
Skip this Video
Download Presentation
Contents of this Talk

Loading in 2 Seconds...

play fullscreen
1 / 37

Contents of this Talk - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Contents of this Talk' - justin-macdonald


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contents of this talk
Contents of this Talk
  • [Used as intro to Genome Databases Seminar, 2002]
  • Overview of bioinformatics
  • Motivations for genome databases
  • Analogy of virus reverse-eng to genome analysis
  • Questions to ask of a genome DB
overview of genome databases

Overview of Genome Databases

Peter D. Karp, Ph.D.

SRI International

[email protected]

www-db.stanford.edu/dbseminar/seminar.html

talk overview
Talk Overview
  • Definition of bioinformatics
  • Motivations for genome databases
  • Computer virus analogy
  • Issues in building genome databases
definition of bioinformatics
Definition of Bioinformatics
  • Computational techniques for management and analysis of biological data and knowledge
    • Methods for disseminating, archiving, interpreting, and mining scientific information
  • Computational theories of biology
  • Genome Databases is a subfield of bioinformatics
motivations for bioinformatics
Motivations for Bioinformatics
  • Growth in molecular-biology knowledge (literature)
  • Genomics
    • Study of genomes through DNA sequencing
    • Industrial Biology
example genomics datatypes
Example Genomics Datatypes
  • Genome sequences
    • DOE Joint Genome Institute
      • 511M bases in Dec 2001
      • 11.97G bases since Mar 1999
  • Gene and protein expression data
  • Protein-protein interaction data
  • Protein 3-D structures
genome databases
Genome Databases
  • Experimental data
    • Archive experimental datasets
    • Retrieving past experimental results should be faster than repeating the experiment
    • Capture alternative analyses
    • Lots of data, simpler semantics
  • Computational symbolic theories
    • Complex theories become too large to be grasped by a single mind
    • The database is the theory
    • Biology is very much concerned with qualitative relationships
    • Less data, more complex semantics
bioinformatics
Bioinformatics
  • Distinct intellectual field at the intersection of CS and molecular biology
  • Distinct field because researchers in the field should know CS, biology, and bioinformatics
  • Spectrum from CS research to biology service
  • Rich source of challenging CS problems
  • Large, noisy, complex data-sets and knowledge-sets
  • Biologists and funding agencies demand working solutions
bioinformatics research
Bioinformatics Research
  • algorithms + data structures = programs
  • algorithms + databases = discoveries
  • Combine sophisticated algorithms with the right content:
    • Properly structured
    • Carefully curated
    • Relevant data fields
    • Proper amount of data
goals of systems biology
Goals of Systems Biology
  • Catalog the molecular parts lists of cells
  • Understand the function(s) of each part
  • Understand how those parts interact to produce the behavior of a cell or organism
  • Understand the evolution of those molecular parts
analogy genome analysis and virus analysis
Analogy: Genome Analysis andVirus Analysis
  • Given: Virus binary executable file for known machine architecture
  • Reverse engineer the program
    • Procedures
    • Call graph
    • Specifications for I/O behavior of the program and all procedures
  • Capture and publish an annotated analysis of the virus
  • Comparative analysis of related viruses
genome analysis
Genome Analysis
  • Example: M. tuberculosis genome
  • Given: 4.4Mbp of DNA (genome)
  • Infer:
    • Molecular parts list of Mtb
    • A model of the biochemical machinery of Mtb cell
  • DNA is a blueprint for the program of life
start
Start

4.4Mbyte binary program

4.4Mbp DNA sequence

step 1
Step 1

Distinguish code from data segments

Find procedure boundaries

Distinguish coding from non-coding regions –

Gene Finding

step 2
Step 2

Predict semantics of procedures

A

C

B

D

Predict gene functions

step 3
Step 3

Predict procedure call graph

D

A

B

C

A

C

B

D

D

A

B

C

Predict biochemical and gene networks

step 4
Step 4

Predict conditions under which procedures are invoked

D

Q

R

A

B

S

C

Predict expression of network fragments

step 5
Step 5

Infer complete program specification

Formulate dynamic cellular simulation

step 6
Step 6

Internet publishing of structured program

annotation with explanations, references,

commentary

Internet publishing of structured genome

annotation with explanations, references,

commentary

step 7
Step 7

Comparative analysis of viruses

Evolutionary relationships among viruses

Comparative analysis of genomes

Evolutionary relationships among genomes

step 8
Step 8

Identify measures to disable virus or prevent its spread

D

Q

R

A

B

S

C

Identify target proteins for anti-microbial drug discovery

database of viruses
Database of Viruses
  • Create a database that stores
    • Binaries for all viruses
    • All annotation of virus programs by different investigators
    • Comparative analyses
  • Support
    • Remote API access
    • Click-at-a-time browsing
reference on major genome databases
Reference on Major Genome Databases
  • Nucleic Acids Research Database Issue
  • http://nar.oupjournals.org/content/vol30/issue1/
    • 112 databases
what are database goals and requirements
What are Database Goals andRequirements?
  • How many users?
  • What expertise do users have?
  • What problems will database be used to solve?
what is its organizing principle
What is its Organizing Principle?
  • Different DBs partition the space of genome information in different dimensions
  • Experimental methods (Genbank, PDB)
  • Organism (EcoCyc, Flybase)
what is its level of interpretation
What is its Level of Interpretation?
  • Laboratory data
  • Primary literature (Genbank)
  • Review (SwissProt, MetaCyc)
  • Does DB model disagreement?
what are its semantics and content
What are its Semantics and Content?
  • What entities and relationships does it model?
  • How does its content overlap with similar DBs?
  • How many entities of each type are present?
  • Sparseness of attributes and statistics on attribute values
what are sources of its data
What are Sources of its Data?
  • Potential information sources
    • Laboratory instruments
    • Scientific literature
      • Manual entry
      • Natural-language text mining
    • Direct submission from the scientific community
      • Genbank
  • Modification policy
    • DB staff only
    • Submission of new entries by scientific community
    • Update access by scientific community
what dbms is employed
What DBMS is Employed?
  • None
  • Relational
  • Object oriented
  • Frame knowledge representation system
distribution user access
Distribution / User Access
  • Multiple distribution forms enhance access
  • Browsing access with visualization tools
  • API
  • Portability
what validation approaches are employed
What Validation Approaches areEmployed?
  • None
  • Declarative consistency constraints
  • Programmatic consistency checking
  • Internal vs external consistency checking
  • What types of systematic errors might DB contain?
database documentation
Database Documentation
  • Schema and its semantics
  • Format
  • API
  • Data acquisition techniques
  • Validation techniques
  • Size of different classes
  • Coverage of subject matter
  • Sparseness of attributes
  • Error rates
relationship of database field to bioinformatics
Relationship of Database Field toBioinformatics
  • Scientists generally ignorant of basic DB principles
    • Complex queries vs click-at-a-time access
    • Data model
    • Defined semantics for DB fields
    • Controlled vocabularies
    • Regular syntax for flatfiles
    • Automated consistency checking
  • Most biologists take one programming class
  • Evolution of typical genome database
  • Finer points of DB research off their radar screen
  • Handfull of DB researchers work in bioinformatics
database field
Database Field
  • For many years, the majority of bioinformatics DBs did not employ a DBMS
    • Flatfiles were the rule
    • Scientists want to see the data directly
    • Commercial DBMSs too expensive, too complex
    • DBAs too expensive
  • Most scientists do not understand
    • Differences between BA, MS, PhD in CS
    • CS research vs applications
    • Implications for project planning, funding, bioinformatics research
recommendation
Recommendation
  • Teaching scientists programming is not enough
  • Teaching scientists how to build a DBMS is irrelevant
  • Teach scientists basic aspects of databases and symbolic computing
    • Database requirements analysis
    • Data models, schema design
    • Knowledge representation, ontologies
    • Formal grammars
    • Complex queries
    • Database interoperability
ad