The Longest Common Subsequence Problem and Its Variants - PowerPoint PPT Presentation

the longest common subsequence problem and its variants n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Longest Common Subsequence Problem and Its Variants PowerPoint Presentation
Download Presentation
The Longest Common Subsequence Problem and Its Variants

play fullscreen
1 / 98
The Longest Common Subsequence Problem and Its Variants
107 Views
Download Presentation
aricin
Download Presentation

The Longest Common Subsequence Problem and Its Variants

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. The Longest Common Subsequence Problem and Its Variants 楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw

  2. Outline • Introduction to Bioinformatics • Traditional LCS Algorithms • Our Works • Block Edit Problems • LCS of Run-Length Encoded Strings • Merged LCS Problem • Mosaic LCS Problem • Conclusions

  3. Introduction to Bioinformatics

  4. 動物細胞(細胞核、細胞質、細胞膜) • DNA位於細胞核內之「核仁」

  5. DNA and RNA • Nucleotide (核甘酸): 腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )

  6. DNA Double Helix (雙股螺旋)

  7. DNA Length • The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the Human Genome Project (1990~2003) • Expected # is 100,000 originally.

  8. From DNA via RNA to Protein

  9. DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes

  10. Promoter(啟動子) and Gene

  11. Amino Acids (胺基酸) 胺基酸:Protein(蛋白質)的基本單位,共20種

  12. Protein Structure

  13. Traditional Dynamic Programming (DP) for the Longest Common Subsequence (LCS) Problem

  14. The Longest Common Subsequence (LCS) Problem • A string : S1 = “TAGTCACG” • A subsequence of S1 : deleting 0 or more symbols from S1 (not necessarily consecutive). e.g. G, AGC, TATC, AGACG • Common subsequences of S1 = “TAGTCACG” and S2 = “AGACTGTC” : GG, AGC, AGACG • Longest common subsequence (LCS) :S1: TAGTCACG S2: AGACTGTC LCS: AGACG

  15. Applications of LCS • The edit distance of two strings or files. (# of deletions and insertions) S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII • Spoken word recognition • Similarity of two biological sequences (DNA or protein) • Sequence alignment

  16. The Traditional LCS Algorithm • S1 = a1a2am and S2 = b1b2bn • Ai,j denotes the length of the longest common subsequence of a1a2 ai and b1 b2 bj. • Dynamic programming: Ai,j = Ai-1,j-1 + 1if ai= bj max{ Ai-1,j, Ai,j-1 }if ai bj A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n. • Time complexity: O(mn) a1a2 ai-1ai b1 b2 bj-1bj

  17. LCS and Edit Distance • Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|

  18. Sequence Alignment S1 = TAGTCACG S2 = AGACTGTC  ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better? • We can set different gap penalties as parameters for different purposes.

  19. Gap Penalty for Sequence Alignment • is the gap penalty. • Suppose

  20. Example for Sequence Alignment TAGTCAC-G-- -AG--ACTGTC

  21. PAM250 Score Matrix for Protein Alignment

  22. MSA, ET and LCS Multiple sequence alignment LCS Phylogeny (evolutionary tree) 親緣樹

  23. Hunt-Szymanski LCS Algorithm • By extending the idea in RSK (Robinson-Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches. • This algorithm is faster than the traditional dynamic programming if r is small.

  24. The Pairs of Matching in Hunt-Szymanski Algorithm • Input sequences: TAGTCACG and AGACTGTC • Pairs of matching:

  25. Example for Hunt-Szymanski Algorithm • The insertion order is row major and column backward. • Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search. L

  26. Time and Space Complexities for LCS

  27. Block Edit Problems

  28. Motivation – Finding Similar Codes

  29. Block Edit Problems • Operations: Block copy, block deletion and block move. • Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed. • Various approximations were proposed. • Our assumptions – Restricted edit sequence: • A series of edit operations are performed from left to right on the source string X. • Any two block-edit operations would not be performed on overlapping regions on X.

  30. A Series of Block Edit Operations

  31. Restricted Edit Sequence (a) General (recursive) edit operations (b) Restricted edit sequence

  32. Definitions of the Problems (1/2) • Let P(o, c) denote a block edit problem: • o: a composition of block-edit operations • c: the class of cost measures • The Block-Copy operations: • External copy: copy a substring of Xto Wi • Internal copy: copy a valid substring of Wi-1to Wi • Shifted copy: copy a shifted substring

  33. Definitions of the Problems (2/2) • The Cost Measures that can be chosen: • Constant cost: pcopy • Linear cost: ps+ k ×pe • Nested cost: pcopy+ dc(A, B) • Three problems are defined in our work: • P(EIS,C) • P(EI,L) • P(EI,N)

  34. Problem 1 -- P(EIS,C) – External, Internal, Shifted, Constant • External and internal copies are allowed in constant cost. • Shifted copies are allowed in constant cost. • It can be solved by a straightforward DP algorithm in O(nm2 (n + m) |Σ|) time. • We propose an O(nm) time DP algorithm with • O(n+m2) preprocessing time in worst case • O(n+mlogm) preprocessing time in average case

  35. Recurrence DP Formula for P(EIS,C) • Straightforward implementation:O(nm2 (n + m) |Σ|) time.

  36. Functions and Operations (1) • Character operations: • Block deletions:

  37. Functions and Operations (2) • External copies: • Internal copies:

  38. Functions and Operations (3) • Shifted copies:

  39. Preprocessing for P(EIS,C) • For external copies: • Build a suffix treeT(XR#YR$) to find the common substrings between X and Y. • For internal copies: • Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wito Wi+1. • For shifted copies: • Compute the differential stringsX'and Y'of Xand Y. • Find the valid common substrings for external / internal copies.

  40. Preprocessing - Suffix Trees

  41. Preprocessing – Longest Common Prefixes (LCP) and Suffix trees

  42. Finding and Maintaining the Range Minimum in Constant Time

  43. Problem 2 -- P(EI,L) – External, Internal, Linear • The cost of each copy or deletion is with an initial penalty plus a linear extended penalty.

  44. Problem 3 -- P(EI,N) – External, Internal, Nested • The copied strings can be further edited with character-edit operations.

  45. Summary of Block Edit Problems

  46. LCS of Run-Length Encoded Strings

  47. LCS of Run-Length Encoded Strings • Run-length encoding (RLE) compressionaaaaabbbccccdd  a5b3c4d2 • Input: • RLE string X: length n, k runs • RLE string Y: length m, l runs • Output: • LCS between X and Y.

  48. Dark & Light Blocks • Divide the DP lattice into k × l blocks. • Dark blocks: matched blocksLight blocks: mismatched blocks

  49. Results of Bunke and Csirik (1995) • Lemma 1 (Dark block): • Lemma 2 (Light block): • Only the boundaries of the blocks are needed.

  50. Results of Liu et al. (2008) • A complex modified DP formula which computes the DP lattice row by row. • Only the bottom boundaries of the blocks are needed.