1 / 38

Semantic Modeling of Biological Sequences

Semantic Modeling of Biological Sequences. Sudha Ram Eller Professor Department of Management Information Systems Eller School of Management The University of Arizona March 5, 2004. Road Map. Background Semantics of DNA sequences and Primary protein structures

mccormickj
Download Presentation

Semantic Modeling of Biological Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Modeling ofBiological Sequences Sudha Ram Eller Professor Department of Management Information Systems Eller School of Management The University of Arizona March 5, 2004

  2. Road Map • Background • Semantics of DNA sequences and Primary protein structures • Semantics of 3-D protein structures • Summary and Future Work

  3. Background • Human Genome Project (HGP) started 1990 by Department of Energy • To sequence the 24 distinct chromosomes comprising the human genome • Completed in April, 2003 – earlier than expected. • Achievements: • Determined the complete sequence of 3 billion DNA subunits, identified all human genes • Stored all the data in databases

  4. Post-Genomic Era “New generalizations and higher order biological laws are being approached but may be obscured by the simple mass of data” ---Morowitz et. al. 1987

  5. More Challenges • Usage and analysis of the data requires: • Ad hoc and complicated queries • Efficient data browsing and retrieving • Integrated data sources • Effective and user-friendly data presentation Find all genes that are structurally similar to a given gene and expressed similarly over a specific DNA microarray dataset

  6. Current Databases • Major DNA sequence databases: • GenBank (Gene Bank) • DDBJ (DNA Data Bank of Japan) • EMBL (European Molecular Biology Laboratory) • Other databases: • Different Types • Different Scales • Different Models --Bioinformatics Databases and Systems

  7. Current Data Models • Data models: • Flatfile (ASN.1) • Relational • XML and its extensions (BSML) • Others • Drawbacks?

  8. Research Motivation • Usage and analysis of the data requires: • Ad hoc and complicated queries • Efficient data browsing and retrieving • Integrated data sources • Effective and user-friendly data presentation • Existing sequence/structure databases not able to provide these capabilities: • Flatfile format hides semantics of data • Relationships/hierarchies are not clear • Don’t support ad hoc and complicated queries

  9. DNA Sequences • Linear Sequences • DNA sequences • Genetic information carrier • Composed of nucleic acids • Primary protein sequences • Composed of amino acids

  10. Protein Building Blocks • Proteins are the most important macromolecules in the factory of living cells that perform various biological tasks • A protein is composed of 20 kinds of amino acids, also known as subunits or residues

  11. Protein Structures • Protein 3-D structures • Intermolecular and intramolecular chemical forces force the linear primary sequence to be folded into 3-D structures to reach the minimum energy/most stable state • Structures determine properties or functions

  12. Levels of Protein Structures—I • Primary (Linear): each building block (amino acid) can be represented by a letter (of the English alphabet) • Secondary: The chain of covalently linked amino acids is further organized by forming regularly repeating patterns due to hydrogen bondings

  13. Levels of Protein Structures—II • Tertiary: Alpha helices and beta sheets fold themselves further into a "chain", cross-linking with one another via their side chains. • Quaternary: For proteins with more than one chain, interaction can occur between the chains themselves.

  14. Previous Work • “A sequence is a mapping between a collection of similarly structured records and the positions of an ordering domain” ----Seshadri et. al., 1995 • Various sequences are just different Ordering Domain and Collection of Records combinations

  15. Time Sequences—I • “For time is just this—number of movements in respect to the ‘before’ and ‘after’”—Aristotle • We want to capture attributes of the movements • We want to know the order/time of the movements • Time is continuous • Temporal databases: deals with semantics of ordered sequences of data values in the time domain.

  16. Time Sequences—II • “Time sequence is basically the sequence of values in the time domain for a single entity instance” ---Segev et. al., 1987 • Time sequences can be: • Step-wise constant • Discrete • Continuous

  17. Biological Sequences—III • Basic model for sequence can be adapted:

  18. Other Sequences • Process Sequences • Sequences of processes and subprocesses • Multimedia sequence • Streams of multimedia data • Image • Audio

  19. Why a New Model for Biological Sequences? • Time sequences are continuous, biological sequences are discrete • More semantics in biological sequences such as sequences and their subsequences • In time sequences, some time points don’t have data, in biological sequences, each position has its own data • Usage of biological data requires that the sequence data be represented and analyzed in different ways

  20. Relation to Gene Ontologies? • Ontology defines biological properties associated with sequence data , however we model semantics of sequence data • No protein structure ontology exists • Both contribute to database integration.

  21. What about Relational technology? • Relational technology doesn’t support data structures as complicated as biological sequences • Hierarchy and semantics are hidden in relations We can never emphasize semantics too much!

  22. DNA Sequences Semantic Model • DNA sequences and primary protein sequences: Ram, S. and Wei, W., Semantic Modeling of Biological Sequences. in Thirteenth Annual Workshop On Information Technology and Systems (WITS'03), Seattle, Washington, December 2003.

  23. Entity Classes - I • ATOMS • Superclass of families of atoms • Collection of atomic components of biological sequence • Domain: Possible components • LINEARORDER • Set of positions (integers) in the sequence • Domain: (1, j) where j is the length of the sequence

  24. Entity Classes - II • SEQUENCES • Ordered list of (ATOM, LINEARORDER) pairs • SUBSEQUENCES • Part of a sequence • Associated with biological activities

  25. New Constructs-I • Sequential Aggregate • It is aggregation of ATOMS and LINEARORDER • It is sequential because order matters • Normal Aggregate • To indicate whole-part relationship • Example: Course and students

  26. New Constructs - II • Fragment • Sequences are segmented • Fragments can overlap

  27. Relationships • Ternary Sequential Aggregation • Fragment:

  28. How many subsequences are fragmented from a specific sequence Find a particular sequence and display a segment from 2nd to 200th Find all the sequences that share one or more specific subsequences Utility of DNA Sequence Model • Semantics of sequence data captured • Ad hoc queries are possible

  29. Protein Databases • The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide archive of experimentally determined three-dimensional structures of proteins. • Data stored in flatfiles • This format records primary and secondary structure of proteins using groups of coordinates. It does not record the tertiary and quaternary structures. No relationships among structures at different levels is captured.

  30. Protein Structure Semantic Model

  31. Entity Classes • ATOMS: This entity class is used to model chemical atoms (C, H, O, N etc) in the protein structure with each of them identified uniquely. • RESIDUES: This entity class represents amino acid subunits, which are the basic building blocks of protein structures. • PRIMARY STRUCTURE • SECONDARY STRUCTURE • TERTIARY STRUCTURE • QUARTERNARY STRUCTURE

  32. Relationships—I • Spatial-Aggregate • P: represents a point using x, y and z coordinates in degrees • T is the temperature at which the structure is determined

  33. Relationships—II • Sequential-Aggregate • LL is the list length • X is the position of the residue in the list • Position of any atom has to be less than or equal to the length

  34. Relationships—III • Spatial-Bonding • Represent the strength and length of the chemical forces among atoms • By describing the semantics of these bonds at each level using additional annotations, we can differentiate between these bonds as they apply to different levels of protein structures

  35. Relationships—IV • An example of annotated relationship • For secondary structures • B: Bond • BE: Bond energy • BL: Bond length

  36. Utility of Protein Structure Model Find the sequence of amino acids for this protein structure Find a set of forces similar to this, and the resulting 3-D structure? Give me all the hydrogen bondings that contribute to the secondary structure

  37. New Operators based on Semantics • Sequence • Subsequence • Aggregate • Comparison of Sequences and Subsequences -- Identical -- Similar -- Partial • Allen’s Predicates: Before, After, Meets, During, Starts, Finishes, Contains, Overlaps.

  38. Future Research • Our ultimate goal is biological sequence database integration • Additional semantics constructs • Semantic reconciliation among databases • Case studies

More Related