Create Presentation
Download Presentation

Download Presentation

The Longest Common Subsequence Problem and Its Variants

Download Presentation
## The Longest Common Subsequence Problem and Its Variants

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**The Longest Common Subsequence Problem and Its Variants**楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw**Outline**• Introduction to Bioinformatics • Traditional LCS Algorithms • Our Works • Block Edit Problems • LCS of Run-Length Encoded Strings • Merged LCS Problem • Mosaic LCS Problem • Conclusions**動物細胞(細胞核、細胞質、細胞膜)**• DNA位於細胞核內之「核仁」**DNA and RNA**• Nucleotide (核甘酸)： 腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )**DNA Length**• The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the Human Genome Project (1990~2003) • Expected # is 100,000 originally.**DNA**TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes**Amino Acids (胺基酸)**胺基酸：Protein(蛋白質)的基本單位，共20種**Traditional Dynamic Programming (DP) for the Longest Common**Subsequence (LCS) Problem**The Longest Common Subsequence (LCS) Problem**• A string : S1 = “TAGTCACG” • A subsequence of S1 : deleting 0 or more symbols from S1 (not necessarily consecutive). e.g. G, AGC, TATC, AGACG • Common subsequences of S1 = “TAGTCACG” and S2 = “AGACTGTC” : GG, AGC, AGACG • Longest common subsequence (LCS) :S1: TAGTCACG S2: AGACTGTC LCS: AGACG**Applications of LCS**• The edit distance of two strings or files. (# of deletions and insertions) S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII • Spoken word recognition • Similarity of two biological sequences (DNA or protein) • Sequence alignment**The Traditional LCS Algorithm**• S1 = a1a2am and S2 = b1b2bn • Ai,j denotes the length of the longest common subsequence of a1a2 ai and b1 b2 bj. • Dynamic programming: Ai,j = Ai-1,j-1 + 1if ai= bj max{ Ai-1,j, Ai,j-1 }if ai bj A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n. • Time complexity: O(mn) a1a2 ai-1ai b1 b2 bj-1bj**LCS and Edit Distance**• Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|**Sequence Alignment**S1 = TAGTCACG S2 = AGACTGTC ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better? • We can set different gap penalties as parameters for different purposes.**Gap Penalty for Sequence Alignment**• is the gap penalty. • Suppose**Example for Sequence Alignment**TAGTCAC-G-- -AG--ACTGTC**MSA, ET and LCS**Multiple sequence alignment LCS Phylogeny (evolutionary tree) 親緣樹**Hunt-Szymanski LCS Algorithm**• By extending the idea in RSK (Robinson-Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches. • This algorithm is faster than the traditional dynamic programming if r is small.**The Pairs of Matching in Hunt-Szymanski Algorithm**• Input sequences: TAGTCACG and AGACTGTC • Pairs of matching:**Example for Hunt-Szymanski Algorithm**• The insertion order is row major and column backward. • Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search. L**Block Edit Problems**• Operations: Block copy, block deletion and block move. • Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed. • Various approximations were proposed. • Our assumptions – Restricted edit sequence: • A series of edit operations are performed from left to right on the source string X. • Any two block-edit operations would not be performed on overlapping regions on X.**Restricted Edit Sequence**(a) General (recursive) edit operations (b) Restricted edit sequence**Definitions of the Problems (1/2)**• Let P(o, c) denote a block edit problem: • o: a composition of block-edit operations • c: the class of cost measures • The Block-Copy operations: • External copy: copy a substring of Xto Wi • Internal copy: copy a valid substring of Wi-1to Wi • Shifted copy: copy a shifted substring**Definitions of the Problems (2/2)**• The Cost Measures that can be chosen: • Constant cost: pcopy • Linear cost: ps+ k ×pe • Nested cost: pcopy+ dc(A, B) • Three problems are defined in our work: • P(EIS,C) • P(EI,L) • P(EI,N)**Problem 1 -- P(EIS,C) – External, Internal, Shifted,**Constant • External and internal copies are allowed in constant cost. • Shifted copies are allowed in constant cost. • It can be solved by a straightforward DP algorithm in O(nm2 (n + m) |Σ|) time. • We propose an O(nm) time DP algorithm with • O(n+m2) preprocessing time in worst case • O(n+mlogm) preprocessing time in average case**Recurrence DP Formula for P(EIS,C)**• Straightforward implementation:O(nm2 (n + m) |Σ|) time.**Functions and Operations (1)**• Character operations: • Block deletions:**Functions and Operations (2)**• External copies: • Internal copies:**Functions and Operations (3)**• Shifted copies:**Preprocessing for P(EIS,C)**• For external copies: • Build a suffix treeT(XR#YR$) to find the common substrings between X and Y. • For internal copies: • Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wito Wi+1. • For shifted copies: • Compute the differential stringsX'and Y'of Xand Y. • Find the valid common substrings for external / internal copies.**Preprocessing – Longest Common Prefixes (LCP) and Suffix**trees**Problem 2 -- P(EI,L) – External, Internal, Linear**• The cost of each copy or deletion is with an initial penalty plus a linear extended penalty.**Problem 3 -- P(EI,N) – External, Internal, Nested**• The copied strings can be further edited with character-edit operations.**LCS of Run-Length Encoded Strings**• Run-length encoding (RLE) compressionaaaaabbbccccdd a5b3c4d2 • Input: • RLE string X: length n, k runs • RLE string Y: length m, l runs • Output: • LCS between X and Y.**Dark & Light Blocks**• Divide the DP lattice into k × l blocks. • Dark blocks: matched blocksLight blocks: mismatched blocks**Results of Bunke and Csirik (1995)**• Lemma 1 (Dark block): • Lemma 2 (Light block): • Only the boundaries of the blocks are needed.**Results of Liu et al. (2008)**• A complex modified DP formula which computes the DP lattice row by row. • Only the bottom boundaries of the blocks are needed.