1 / 27

Clustering Sequences in a Metric Space

Clustering Sequences in a Metric Space. The MoBIoS Project. Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu Department of Computer Sciences, University of Texas Austin, TX 78712, USA {rmao, miranker}@cs.utexas.edu

zyta
Download Presentation

Clustering Sequences in a Metric Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Sequences in a Metric Space The MoBIoS Project • Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu • Department of Computer Sciences, University of Texas • Austin, TX 78712, USA • {rmao, miranker}@cs.utexas.edu • Research supported in part by the Texas Higher Education Coordinating Board, Texas Advanced Research Program.

  2. Immediate Goal:Use Metric Space Indexing to Support Homology Search • Develop tree-based index structure to speed homology search. • Maintain the use of an evolutionary model of similarity. • Deliver full Smith-Waterman sensitivity.

  3. Metric Space • A metric space[CPRZ97a] is a pair, M=(D,d), where D is a domain of indexing keys, and d is distance function with the following properties: • d(Ox,Oy) = d (Oy,Ox) (symmetry) • d(Ox,Oy) > 0, d(Ox,Ox) = 0 (non negativity) • d(Ox,Oy) <= d(Ox,Oz) + d(Oz,Oy) (triangle inequality)

  4. Metric Space Indexing • Metric space indexing exploits intrinsic clustering of the data. • Hierarchical structure • avoids linear scan of entire database. • in the best case leads to search time logarithmic to database size.

  5. Challenges • Local alignment does not form a metric. • Local Alignment produces a set of answers, a distance function produce a single number. • Popular evolutionary models (PAM) are not metrics. • PAM Matrices are based on log-odds • Negative Values • PAM Matices are Asymmetric • Let Pr(x,y) be the probability that amino acid x, mutates to amino acid y. • Pr(x,y)  Pr(y,x) • More similar sequences score higher, not lower • Identical sequences must be distance 0 apart

  6. X   Y Z From PAM to mPAM - Symmetry • PAM: The computation of PAM matrix computed frequency of one amino acid mutating to another. • mPAM: We model that a pair of amino acids, one in each sequence, evolved from a common ancestor [Gonnet & Korostensky] • The probability that amino acid Y and amino acid Z are from same ancestor amino acid x is: Pr(y,z)= f(x)Pr(x,y)Pr(x,z)

  7. mPAM matrix From PAM to mPAM – Distance vs. Similarity • PAM computed log-odds based on frequency of mutations • mPAM: Compute the expected time for a particular mutation to occur. • More frequent mutations will occur, on average, in less time.

  8. Computing Local Alignments from an Index • Divide the database into small fixed size pieces • Build a metric-space index based on global alignment • Divide the query into small fixed size pieces • For each query piece, use index to find results based on global alignment. • Like BLASTS hot spot index, but is fully sensitive • Chain the results together • Intuitively like BLASTS extension of hot-spots • Best algorithm is the last step of “A Sublinear Algorithm for Approximate Keyword Searching”[Myers94]

  9. Initial Results: M-Tree • M-tree [CPRZ97b]is an open-source Metric-space indexing package. • Results for global alignment of Yeast peptide sequences of length 10. • Compare M-tree clustering result with farthest-first traversal bulk load clustering result.

  10. For a set of queries, the average fraction of number of leaves visited in the searching to the total leaf number decreases while the database size goes up. This shows that is the database is clustered well.

  11. The covering radius of routing objects of one level of M-tree decreases while descending the tree. This shows the database the hierarchically clustered well.

  12. The covering radius of routing objects of one database tree level created by farthest-first traversal bulk load decreases while descending the tree. This shows the database the hierarchically clustered well. The radii here are significantly smaller than those of M-tree, which means that we can build a new index structure that is better than M-tree.

  13. Traditional relational databases: Data is dynamic Workload: Regular, exact, periodic queries Billing Customer service Transactional inventory bank accounts Biological databases Data is write-once. Workload: Ad-hoc queries based Data clustering (mining) Biological data types are non-relational Biological data types do cluster in metric-spaces Genomic/proteomic sequences Mass-Spectrometer signatures Molecular Models Long-term Goal:Biologists Need a New Kind of DBMS

  14. MoBIoS Architecture(Molecular Biological Information System)

  15. Metric-Space Index Structure: Persistent representation Multiple hierarchical trees. Choice of metric distance functions, including user defined Results in: Efficient clustering of the database Search time logarithmic to the database size Storage Manager

  16. MoBIoS SQL (M-SQL) Built-in biological data types Sequence Mass-spectra data Embodies evolutionary semantics of bioinformatic investigation Examples: Homology look-up Gene fusion experiment Query Engine

  17. M-SQL Program for MS/MS Protein Identifcation // Return proteins in the intersection of recorded spectra sufficiently // similar, range1, to the measured spectra of the first MS, and proteins // which have a digested fragment computed to be sufficiently similar in // sequence to the sequencing determined by the second MS // Database is loaded with genomic and proteomic information Create table protein_sequences (accesion_id int, sequence peptide, …, primary metrickey(sequence, mPAM250); Create table digested_sequences(accession_id int, fragment peptide, enzyme varchar, ms_peak int…, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra(accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance); SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2)

  18. Primitives for relating clusters to each other. Gene expression Protein family Possible syntax Mining Engine(No much idea???)

  19. Applications • Homology search • Proteomics • MS/MS and Ion-Trap MS need both MS signature and sequence data to analyze results • Gene Expression • Built in clustering algorithms • Sequence Assembly

  20. Properties of Biological Databases • Data is write-once. • Workload: • Ad-hoc retrieval queries based on evolutionary criteria • Data clustering and categorization (mining) • Many biological data types are non-relational • Genomic/proteomic sequences • Structural and functional annotations to sequences. • Mass-Spectrometer signatures • Molecular Models

  21. Existing Methods • BLAST • Build index of the query sequence, linear scan the database • BLAT • Build index of the database, search the database based on exact match of fixed length segments • SST • Tree-structured index for vector space object

  22. From PAM matrix to mPAM • PAM matrix is one of most commonly used substitution matrix to compute the similarity between two peptide sequence under an evolutional model. • PAM matrix can not be used directly for metric distance indexing technique. • Similarity score don’t have reflexivity properties. • There are negative values. • Doesn’t satisfy triangular inequality rules Figure-2 Logodds matrix for 250PAMs. (DayHoff 1978)

  23. Metric-Space Indexing to Speed Homology Search • Split the database and build metric space index structure • Split the query sequence • Search the query segments in the metric indexing database • Chain the search results

  24. Results

  25. Min length: 3 Max length: 80 Threshold: ln10 (2.303) Segment number: 1M Trial number: 1M Bucket number: 100 Sequential search range: 80

  26. References • [CPRZ97a] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy, June 1997. • [CPRZ97b] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces”. Proc. VLDB, 1997. • [DSO78] Dayhoff M.O., Schwartz R. and Orcutt B.C. (1978) Atlas of protein sequence and structure. Vol. 5, Suppl. 3, Ed. M. O. Dayhoff. • [MT] The M-Tree Project Homepage, http://www-db.deis.unibo.it/Mtree/index.html • [SW81] Temple F. Smith and Mchael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.

More Related