1 / 12

Longest Common Rigid Subsequence

Longest Common Rigid Subsequence. Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada. (Rigid) Subsequence. Subsequence: C OMBINATORIAL P ATTERN M ATCHING CPM Rigid Subsequence: 0123456789012345678901234567

zonta
Download Presentation

Longest Common Rigid Subsequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Longest Common Rigid Subsequence Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada.

  2. (Rigid) Subsequence • Subsequence: COMBINATORIALPATTERNMATCHING CPM • Rigid Subsequence: 0123456789012345678901234567 COMBINATORIALPATTERNMATCHING CPM, (13,7)

  3. Common (Rigid) Subsequence • Longest Common Subsequence (LCS) • combinatorial pattern matching • longest common rigid subsequence comnienc • Longest Common Rigid Subsequence (LCRS) • combinatorial pattern matching • longest common rigid subsequence comni,(1,1,3,5)

  4. Previous Results • LCS and LCRS of two strings: • polynomial time solvable • LCS of many strings: • Cannot be approximated within ratio in polynomial time (Jiang and Li 1995, SIAM J COMP). • For random instances, a simple greedy algorithm can give an almost optimal solution with only small error. • LCRS of many strings: • Exponential time algorithms. • Our CPM paper tries to answer the time complexity.

  5. Motivation in Bioinformatics • In biochemistry, a motif is a recurring pattern in DNA/protein sequences. • A protein motif (SH3 domain binding motif) in J. Biological Chemistry 269:24034-9. • Many motifs can be found at PROSITE database of ExPASy.

  6. Motivation • Rigoutsos and Floratos proposed the following problem (Bioinformatics 14:55-67,1998). • Given n strings and a positive number K, find a longest “rigid pattern” (rigid subsequence) that occurs in at least K of the n strings. • When K=n, it is LCRS. • Exponential time algorithms were studied. • NP-hardness unknown.

  7. Our Results • LCRS is MAX-SNP hard • Therefore, Rigoutsos and Floratos’ problem is also MAX-SNP hard. • For random instances, there is an algorithm solves LCRS with quasi-polynomial average running time. • The algorithm also works for Rigoutsos and Floratos’ problem with simple modifications.

  8. MAX-SNP hard • L-reduction from Max-Cut edge edge edge edge vertex vertex delimiter delimiter delimiter

  9. The construction of each edge aaa aba bab contributes 0 aaa aba bab contributes 1 aaa aba bab contributes 1 Three possible configurations in an ungapped alignment

  10. The Algorithm • Let Si be the set of length-i common rigid subsequences. • We only need to prove that

  11. Sketch of Proof • For each rigid subsequence in Si, the probability it occurs in one random string of length n • The prob. that it occurs in every input string • There are in total length i rigid subsequences. • This can be done by two cases i<=2 log n and i> 2 logn.

  12. Acknowledgement • Supported by NSERC, PREA and CRC.

More Related