1 / 69

Text Indexing and Dictionary Matching with One Error

Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem :

xiu
Download Presentation

Text Indexing and Dictionary Matching with One Error

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng

  2. Problem Definition • The Indexing Problem: • Input:A Text T of length n over alphabet Σ, a pattern P of length m over alphabet Σ and an integer k. • Output: All occurrences of P in T with at most k mismatches.

  3. Main idea • In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.

  4. Processing • 1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn … t1.

  5. Ex: T=AGCAGAT TR=TAGACGA

  6. Ex: T=AGCAGAT TR=TAGACGA

  7. Processing • 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

  8. Ex: T=AGCAGAT TR=TAGACGA

  9. Processing • 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.

  10. Ex: T=AGCAGAT TR=TAGACGA

  11. Processing • 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n – i + 3, where i is the starting position of the leaf’s suffix in TR.

  12. Ex: T=AGCAGAT TR=TAGACGA

  13. Query Processing • For j = 1, …., m do • 1. Find node v, the location of Pj+1 … Pm in ST, if such a node exists. • 2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist. • 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all iI.

  14. Example Ex: T=actgacctcagctta P=ctga k=1

  15. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T

  16. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR

  17. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε}

  18. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε} W[vl….vr]={3,12,…,14} I={ε}

  19. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε}

  20. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε} W[vl….vr]={4,8,9,14,11} I={ε}

  21. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10}

  22. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10} W[vl….vr]={5,10,15} I={15,5,10}

  23. When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all iI. T2…T6, T7…T11, T12…T15。 T=actgacctcagctta P=ctaa

  24. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=4 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13}

  25. Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13} W[vl….vr]={ε} I={ε}

  26. Range Query Problem • In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

  27. Problem Definition of Range Query • Input: Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. • Output: Find the intersection of elements of V[i … i+k] and W[j … j+l].

  28. Example: V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]

  29. Preprocessing 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 V= W= 1 2 3 4 5 6 7 8

  30. Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 7 V= W= 4 1 2 3 4 5 6 7 8 6 2 1 5 8

  31. Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 6 V= W= 4 1 2 3 4 5 6 7 8 7 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}. 1 5 8

  32. Time Complexity of Range Query Problem • By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

  33. Time Complexity • For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.

  34. The Dictionary Matching Problem

  35. Problem Definition • The Dictionary Matching Problem • Input: • 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns. • 2. A Text T of length n over alphabet Σ. • 3. An integer k. • Output: • All occurrences of any dictionary patterns in T with at most k mismatches.

  36. Main idea • In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.

  37. Processing • 1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.

  38. Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  39. Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  40. Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  41. Processing • 2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.

  42. Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  43. Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  44. Preprocessing • 3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

  45. Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  46. Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$

  47. Query Processing • For j = 1,…., n do • 1. Find node v, the location of the longest prefix of tj+1 … tn in SD. • 2. Find node w, the location of the longest prefix of tj-1 … t1 in SDR. • 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.

  48. Example T=acagccga D={tca,gctga,gca} K=1

  49. Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga

  50. Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga

More Related