New tabulation and dynamic programming based techniques for sequence similarity problems

New tabulation and dynamicprogramming based techniques for sequence similarity problems Szymon Grabowski Lodz University of Technology, Institute of Applied Computer Science, Łódź, Polandsgrabow@kis.p.lodz.pl Sept. 2014

Agenda (Naïve) dynamic programming. Four Russians. Main LCS results. Bille & Farach-Coltontechnique. Our improvement of the BFC alg. Our LCS result with sparse DP. Algorithmic apps (Lev distance, LCTS, MerLCS). Concl & open problems.

Dynamic Programming (DP) • Everybody knows… • Quadratic cost for 2 sequences (can’t compute a cell "in a middle" before knowing the previous rows/cols), • Speedup ideas: tabulation (aka Four Russians),bit-parallelism, sparse dynamic programming,compressing the input sequences. 3

DP made (slightly) faster If we can process blocks of bb symbols in O(1) time, we immediately obtain O(mn / b2) time. We can do it (Masek & Paterson, 1980) e.g. for binary alphabet and b = log n / 4  O(mn / log2n) time. The idea is to precompute all possible inputs (short enough strings are guaranteed to repeat and represent the DP values in differential manner).

LCS, selected results (time compl.) Standard DP:O(mn). Tabulation (Masek & Paterson, 1980): O(mn / log2n) for a constant alphabet. Tabulation (Bille & Farach-Colton, 2008): O(mn (log log n)2 / log2n) for an integer alphabet. Bit-parallelism (Allison & Dix, 1986, …): O(mn / w), w log n is machine word size (in bits). Sparse DP: Hunt & Szymanski, 1977: O(r log log n), r is the # of matches,Eppstein, Galil, Giancarlo & Italiano, 1992: O(D log log(min{D, mn / D})), Dr is the # of dominant matches. 5

LCS, selected results, cont’d Sparse DP: Sakai, 2012: O(m + min{D, p(m-q)} + n),where p = LCS(A, B), q = LCS(A[1…m], B). LZ78-compressed input:Crochemore, Landau & Ziv-Ukelson, 2003:O(hmn / log n), for a constant alphabet,where h 1 is the entropy of the inputs (for a binary alph.). RLE-compressed input:several results, incl. Liu, Wang & Lee, 2008:O(min{nl, km}), where l, m are RLE-compressed seq lengths. SLP-compressed input:Gawrychowski, 2012: O(kn sqrt(log(n / k)), where k is total length of SLP-compressed sequences.

The technique of Bille & Farach-Colton For an integer alphabet of size , the Masek & Patersonresult can easily be modified to have O(mn log2 / log2n) time. This is fine for small , but not if  = nc, c > 0. Bille & Farach-Colton use alphabet mapping in superblocks.Use superblocks of size e.g. log3n log3nand divide each superblock into blocks of size (log n / log log n) (log n / log log n).

BFC, cont’d That is, for current text snippet from A of length log3nextract up to log3n distinct symbols and encode the current snippet of A and current snippet of B accordingly (one extra symbol for "smth else" in snippet B needed). Easily, O(log log n) bits per encoded symbol are enough, mapping times overall negligible (a BST can be used with log(superblock)-factor per symbol) and O(mn (log log n)2 / log2n) total time.

BFC, alphabet mapping example Blocks of size 3  3, superblocks of size 9  9.

Our technique (Alg 1) Use the BFC alphabet mapping in superblocks. But use many LUTs (instead of 1), yet with modified input.One LUT per horizontal stripe (of length n). • The LUT input: • snippet of A, • left block border (1 bit per cell), • upper block border (1 bit per cell). • No snippet of B as part of the input, as it is fixed for a given LUT! (Re-use LUTs for repeating snippets of B.) • Thanks to it, we work on rectangular (not square) "portrait"-oriented blocksof size (log n / log log n)  (log n).

One horizontal stripe (4 blocks of 5  5) seq A seqB Red arrows: explicitly stored LCS values; black arrows: diff-encoded LCS values. 05550 and 34023: text snippets encoded with ref to a superblock (not shown). The diagonally shaded cells are the block output cells. 11

LCS, first result (Alg 1) 12

Output-dependent algorithm We work in blocks of (b+1)  (b+1), but divide theminto sparse ones, which have K matches,and dense ones with > K matches. Key observation:knowing the top row and leftmost column for the blockplus the location of all matches in itis enough to compute this block.That is, the text snippets are not needed!

Where sparse DP meets tabulation • A sparse block input: • top row: b bits (diff encoding), • leftmost column: b bits (diff encoding), • match locations: each in log(b2) bits,totalling O(K log b) bits. • (Output: even less.) Hence, if K log b + b = O(log n) (with a small enough constant), we can use a LUT for all sparse blocks and compute each of them in constant time.

Dense blocks Dense blocks are partitioned into smaller blockswhich then will be processed by our technique from Alg 1. The smaller block sizes are:(log n / log log n)  (b).

If the fraction of dense blocks in the matrix is 0 < fd 1,then the total time complexity (w/o preprocessing!) is: Choosing the parameters b = O(log n) (otherwise the LUT build costs will be dominating), but also b = (log n / sqrt(log log n)) (otherwise this alg will never beat Alg 1). This implies K = (log n / log log n), with an appropriate constant. For a small enough r (= total # of matches in the matrix) we may have O(mn / log2n) from the above formula, alas in the pp we have to find and encode all matches in all sparse blocks, in O(n + r) time.

LCS, second result (Alg 2)

and and Alg 2 niche • Considering the results of: • Eppstein et al., 1992, • Sakai, 2012, • Alg 1, • we obtain the following niche in which Alg 2 is the winner:

Simple generalization of Th. 1 and 2

Longest common transposition-invariant subsequence (LCTS) LCTS = LCS in the best key transposition (in music, transposition is shifting a sequence of notes (pitches) up or down by a constant interval).

Navarro, Grabowski, Mäkinen, Deorowicz, 2005; Deorowicz, 2006  apply BFC technique for each transposition New algorithm: let us call the transpositions withat least mn log log n /  matches as dense,the others as sparse. Apply Alg 1 to the dense transpositions and Alg 2 to the sparse ones. Overall time: for LCTS, known results and a new one

Merged LCS (MerLCS) A bioinformatics problem on 3 sequences:given sequences A, B and P, return a longest seq. T that is a subsequence of Pand can be split into two subsequences T’ and T’’such that T’ is a subsequence of Aand T’’ is a subsequence of B.|A| = n, |B| = m, |P| = u. Known results:Peng, Yang, Huang, Tseng & Hor, 2010: O(lmn) time,where ln is the result length. Deorowicz & Danek, 2013: O(u / wmn log w) time.

Our result for MerLCS DP matrix property:Deorowicz and Danek noticed thatM(i, j, k) is equal to or larger by 1 thanany of the three neighhbors: M(i – 1, j, k), M(i, j – 1, k), M(i, j, k – 1). We generalize our result on 2 sequences to 3 sequences (input: 3 text snippetsplus 3 2-dim walls instead of 1-dim borders!)to obtain O(mnu / log3/2n) for MerLCS,if u = (nc) for some c > 0.

Conclusions • Tabulation (= Four Russians) is a classic DP-boosting technique. Interestingly, we managed to (slightly) improve its application to the LCS / edit distance problem. • Applying tabulation may be even better for a sparse matrix. • These techniques work also for a few other problems than LCS and edit distance. 24

Open problems • Can we improve the tabulation based result on compressible sequences? • Can we adopt our technique(s) to problemsin which the conditions from Lemma 3 (or Lemma 7, involving 3 sequences) are relaxed, that is, consecutive DP cells may (sometimes) differ more than by a constant?Exemplary problem: SEQ-EC-LCS (Chen & Chao, 2011; Deorowicz & Grabowski, 2014). 25

New tabulation and dynamic programming based techniques for sequence similarity problems

New tabulation and dynamic programming based techniques for sequence similarity problems

Presentation Transcript

Dynamic Programming: Sequence alignment

Sequence Similarity

Similarity Evaluation Techniques for Filtering Problems

Sequence comparison: More dynamic programming

Similarity-based Classifiers: Problems and Solutions

Sequence Alignments and Dynamic Programming

Sequence Similarity

Pairwise Alignments and Sequence Similarity-Based Searching

Protein Sequence- and Structure-based Similarity Networks

PLaSMA: A new dynamic programming algorithm for multiple sequence alignment

Dynamic Programming for Sequence alignment

Sequence Alignment and Dynamic Programming

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming

Sequence Similarity

Sequence comparison: Dynamic programming

Sequence Alignment by Dynamic Programming

Sequence Similarity

Dynamic Programming: Sequence alignment

Sequence comparison: More dynamic programming

Sequence comparison: Dynamic programming

Sequence similarity