1 / 36

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe. Overview. MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS. MoBIoS Project.

andres
Download Presentation

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

  2. Overview • MoBIoS Project • Motivation • The challenge • Established similarity measures • Metric-space distance measure • Disk-based metric tree index • MoBIoS as a DBMS • Application of MoBIoS

  3. MoBIoS Project • Molecular Biological Information System • Project at UT-Austin center for computational biology and bioinformatics. • DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

  4. Motivation Develop a DBMS to power Biological Information System

  5. The Challenge • Established biological model of similarity measure do not form a metrics. • Scalable disk-based metric-indexes suffer from the Curse of dimensionality

  6. Established Similarity Measure (I) • Sequence Homology • Query Sequence • Database of sequences • Substitution Matrix (PAM / BLOSUM) • Similarity Measure • Global Sequence Alignment (Edit distance) • Local Sequence Alignment (Most important)

  7. Established Similarity Measure (II) • Local Sequence Alignment • A local sequence alignment query asks, given a query sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S • Main issue: Result is a set of answer. • A metric distance function must return a single value for each pair of argument

  8. Established Similarity Measure (III) • Global Sequence Alignment • Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example) • Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

  9. Metric-space Distance measure (I) • Homology Search • Query Sequence: Sub strings of length q (q-grams) • Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings. • Substitution Matrix (mPAM) • Similarity Measure (distance measure) • Local Alignments is computed from global alignment.

  10. Metric-space Distance measure (II) • mPAM substitution Matrix • Accepted Point Mutation Model. • PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other. • mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution. • mPAM has been validated.(Validation)

  11. Metric-space Distance measure (III) • Computing Local Alignment from Global Alignment (Algorithm) • Offline • Divide database of sequence into sub strings (q-grams) • Build metric-space index structure on q-grams • Online • Divide query sequence into sub strings (q-grams) • Using global alignment as a distance function to match query q-grams.

  12. Disk-based metric-tree index • Phases • Initialization • Searching • Query performance metric • Number of disk I/O ( nodes visited) • Number of distance computation • Options Exploited • M-Tree • Generalized Hyper plane tree • MVP-Tree (optimal)

  13. M-Tree initialization Best case : O(nlogn); worst case: O(n3) Generalized Hyper plane (GH-Tree) initialization Best case : O(nlogn); worst case: O(n2) GH-tree: Bi-direction M-Tree: Bottom-up In practice, both M-Tree and GH-Tree scale linearly Disk-based metric-tree index (initialization)

  14. Disk-based metric-tree index (Searching)

  15. Mckoi (Java RDBMS). Plus metric-space indexing Plus Biological data types Plus biological semantics Life science data store Biological sequence data Mass-spectrometry protein signature MoBIoS as a DBMS (I)

  16. MoBIoS as a DBMS (III) • Language Extension • M-SQL • Data type Extension • Data type for Sequences (DNA,RNA,peptide) • Data type for Mass spectrum • Semantics Extension • Subsequence Operators • Local alignment

  17. MoBIoS as a DBMS (IV) • Semantics Extension • Similarity (metric distance) between data types • mPAM250 • Cosine distance • Lk norms • Keys Extension • Primary key (metrickey) • Index (metric)

  18. Application of MoBIoS (I) • MS/MS Protein Identification • Breakdown protein into fragments called peptide using a protease enzyme • Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

  19. M-SQL Solution Create table protein_sequences (accesion_id int, sequence peptide, primary metrickey(sequence, mPAM250); Create table digested_sequences (accession_id int, fragment peptide, enzyme varchar, ms_peak int, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra (accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance); Application of MoBIoS(II)

  20. Application of MoBIoS(III) • M-SQL Solution SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS,mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2)

  21. MoBIoS Molecular Biological Information System DBMS specialized for storage, retrieval and mining of biological data Sequence Database and query sequence is divided into q-grams and Database is indexed offline. BLAST Basic Local Alignment Search Tool Utility specialized for retrieval and mining of biological data outside a database Only query sequence is divide and hot-point index is done at query time BLAST vs MoBIoS

  22. MoBIoS Demo • MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp • PDB : http://www.rcsb.org/pdb/

  23. Conclusion • Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

  24. References • http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf • http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf • http://www.cs.utexas.edu/users/mobios/ • http://www.mckoi.com/database/

  25. Appendix Return

  26. Appendix I- Metric A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z, • Non-Negativity d(x,y) > 0 for x = y; d(x,y) = 0 for x = y • Symmetry d(x,y) = d(y,x) • Triangular inequality d(x,y) + d(y,z) = d(x,y) Return

  27. Appendix II - Sequence • 2 RNA sequences from a DNA strand. Return

  28. Appendix III - PAM Percent Accepted Mutation(PAM) A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability. Return

  29. Appendix IV – PAM250 • At this evolutionary distance (250 substitutions per hundred residues) Return

  30. Appendix V - BLOSUM Blocks Substitution Matrix (BLOSUM) A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability Return

  31. Appendix VI – BLOSUM62 • BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical Return

  32. Appendix VII – mPAM250 • Expected time based on 250 PAM distance as a unit. Return

  33. Based on benchmark query set by Smith-Waterman. Graph shows ROC50 values (Receiver Operating Characteristics) Negative x- axis indicate mPAM has better performance Difference between ROC50 values using mPAM and PAM250 Appendix VIII – mPAM Validation Return

  34. Appendix IX - Distance measure Global Sequence Alignment Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t. Return

  35. Appendix X – Homology Search Build Index Structure(Offline) • Divide the database sequences into a set of overlapping sub strings of length q (q-grams) with step size 1. • Build a metric-space index D based on global alignment to support constant time lookup of exact match. Homology Search Query (Online) • Divide the query sequence W into overlapping sub string , F = {wi | i =0..| W |-q }, of length q with step size 1. • For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function. • Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W Return

  36. Appendix XI - GSA Return

More Related