MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Overview • MoBIoS Project • Motivation • The challenge • Established similarity measures • Metric-space distance measure • Disk-based metric tree index • MoBIoS as a DBMS • Application of MoBIoS

MoBIoS Project • Molecular Biological Information System • Project at UT-Austin center for computational biology and bioinformatics. • DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

Motivation Develop a DBMS to power Biological Information System

The Challenge • Established biological model of similarity measure do not form a metrics. • Scalable disk-based metric-indexes suffer from the Curse of dimensionality

Established Similarity Measure (I) • Sequence Homology • Query Sequence • Database of sequences • Substitution Matrix (PAM / BLOSUM) • Similarity Measure • Global Sequence Alignment (Edit distance) • Local Sequence Alignment (Most important)

Established Similarity Measure (II) • Local Sequence Alignment • A local sequence alignment query asks, given a query sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S • Main issue: Result is a set of answer. • A metric distance function must return a single value for each pair of argument

Established Similarity Measure (III) • Global Sequence Alignment • Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example) • Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

Metric-space Distance measure (I) • Homology Search • Query Sequence: Sub strings of length q (q-grams) • Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings. • Substitution Matrix (mPAM) • Similarity Measure (distance measure) • Local Alignments is computed from global alignment.

Metric-space Distance measure (II) • mPAM substitution Matrix • Accepted Point Mutation Model. • PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other. • mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution. • mPAM has been validated.(Validation)

Metric-space Distance measure (III) • Computing Local Alignment from Global Alignment (Algorithm) • Offline • Divide database of sequence into sub strings (q-grams) • Build metric-space index structure on q-grams • Online • Divide query sequence into sub strings (q-grams) • Using global alignment as a distance function to match query q-grams.

Disk-based metric-tree index • Phases • Initialization • Searching • Query performance metric • Number of disk I/O ( nodes visited) • Number of distance computation • Options Exploited • M-Tree • Generalized Hyper plane tree • MVP-Tree (optimal)

M-Tree initialization Best case : O(nlogn); worst case: O(n3) Generalized Hyper plane (GH-Tree) initialization Best case : O(nlogn); worst case: O(n2) GH-tree: Bi-direction M-Tree: Bottom-up In practice, both M-Tree and GH-Tree scale linearly Disk-based metric-tree index (initialization)

Disk-based metric-tree index (Searching)

Mckoi (Java RDBMS). Plus metric-space indexing Plus Biological data types Plus biological semantics Life science data store Biological sequence data Mass-spectrometry protein signature MoBIoS as a DBMS (I)

MoBIoS as a DBMS (III) • Language Extension • M-SQL • Data type Extension • Data type for Sequences (DNA,RNA,peptide) • Data type for Mass spectrum • Semantics Extension • Subsequence Operators • Local alignment

MoBIoS as a DBMS (IV) • Semantics Extension • Similarity (metric distance) between data types • mPAM250 • Cosine distance • Lk norms • Keys Extension • Primary key (metrickey) • Index (metric)

Application of MoBIoS (I) • MS/MS Protein Identification • Breakdown protein into fragments called peptide using a protease enzyme • Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

M-SQL Solution Create table protein_sequences (accesion_id int, sequence peptide, primary metrickey(sequence, mPAM250); Create table digested_sequences (accession_id int, fragment peptide, enzyme varchar, ms_peak int, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra (accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance); Application of MoBIoS(II)

Application of MoBIoS(III) • M-SQL Solution SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS,mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2)

MoBIoS Molecular Biological Information System DBMS specialized for storage, retrieval and mining of biological data Sequence Database and query sequence is divided into q-grams and Database is indexed offline. BLAST Basic Local Alignment Search Tool Utility specialized for retrieval and mining of biological data outside a database Only query sequence is divide and hot-point index is done at query time BLAST vs MoBIoS

MoBIoS Demo • MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp • PDB : http://www.rcsb.org/pdb/

Conclusion • Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

References • http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf • http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf • http://www.cs.utexas.edu/users/mobios/ • http://www.mckoi.com/database/

Appendix Return

Appendix I- Metric A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z, • Non-Negativity d(x,y) > 0 for x = y; d(x,y) = 0 for x = y • Symmetry d(x,y) = d(y,x) • Triangular inequality d(x,y) + d(y,z) = d(x,y) Return

Appendix II - Sequence • 2 RNA sequences from a DNA strand. Return

Appendix III - PAM Percent Accepted Mutation(PAM) A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability. Return

Appendix IV – PAM250 • At this evolutionary distance (250 substitutions per hundred residues) Return

Appendix V - BLOSUM Blocks Substitution Matrix (BLOSUM) A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability Return

Appendix VI – BLOSUM62 • BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical Return

Appendix VII – mPAM250 • Expected time based on 250 PAM distance as a unit. Return

Based on benchmark query set by Smith-Waterman. Graph shows ROC50 values (Receiver Operating Characteristics) Negative x- axis indicate mPAM has better performance Difference between ROC50 values using mPAM and PAM250 Appendix VIII – mPAM Validation Return

Appendix IX - Distance measure Global Sequence Alignment Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t. Return

Appendix X – Homology Search Build Index Structure(Offline) • Divide the database sequences into a set of overlapping sub strings of length q (q-grams) with step size 1. • Build a metric-space index D based on global alignment to support constant time lookup of exact match. Homology Search Query (Online) • Divide the query sequence W into overlapping sub string , F = {wi | i =0..| W |-q }, of length q with step size 1. • For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function. • Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W Return

Appendix XI - GSA Return

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Presentation Transcript

SNP Resources: Finding SNPs Discovery and Databases

A Metric-based Framework for Automatic Taxonomy Induction

Darwin’s Voyage of Discovery

College Connection

BVSR

Unit 1 Introduction to DBMS

Biological Rhythms and Sleep

Ch3 – Metric Conversions

Biological Rhythms and Sleep

Chapter 6 Biological Oxidation

Introduction to Database Management Systems (DBMS)

Joint pathology

SCM 330 Ocean Discovery through Technology

EECS 800 Research Seminar Mining Biological Data

Protein – protein interaction

Air Pollution

Space and Time

BIOASSAY TECHNIQUES FOR DRUG DISCOVERY AND DEVELOPMENT

Database Design Specialist

5 Characteristics of a Great Presenter