1 / 68

Recuperació de la informació

Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html

Download Presentation

Recuperació de la informació

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recuperació de la informació • Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html • Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html

  2. String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

  3. Extended string matching • Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. • Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. • Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. • Wild cards: we find pattern as AT*TA where * means an arbitrary long string. • Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..

  4. Classes of characters There are classes of characters represented by one symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any) 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols 2. Classes of characters in the pattern. There are characters in the pattern that represent sets of simbols

  5. Extended alphabets First part Classes in the text

  6. Classes in the text: Brute force algorithm • How the comparison is made? Text : over 2|∑| Pattern over  From left to right: prefix We need the operation: belongs to a set ? ? • Which is the next position of the window? Text : Pattern : The window is shifted only one cell

  7. Classes in the text: Brute force algorithm When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=( , , , )

  8. Classes in the text: Brute force algorithm When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=( , , , )

  9. Classes in the text: Brute force algorithm When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to setX” is made with ...

  10. Classes in the text: Brute force algorithm A T G T A A T G T A When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0

  11. Classes in the text: Brute force algorithm A T G T A A T G T A A T G T A When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 I(A) & I(R)>0

  12. Classes in the text: Brute force algorithm A T G T A A T G T A A T G T A When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 I(A) & I(R)>0 I(T) & I(R)>0 I(A) & I(N)>0 ... Which is the cost?

  13. Classes in the text Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  14. Classes in the text: Horspool algorithm • How the comparison is made? Text : Pattern : Sufix search • Which is the next position of the window? a Text : Pattern : Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern: a a a a a a We need a shift table with the extended alphabet.

  15. Classes in the text :Horspool example A 4 C 5 G 2 T 1 R ? … N ? Given the pattern ATGTA • The shift table is:

  16. Classes in the text :Horspool example A 4 C 5 G 2 T 1 R 2 … N ? Given the pattern ATGTA • The shift table is:

  17. Classes in the text :Horspool example text : G T A R T R N A A G G A … A T G T A A T G T A A T G T A A 4 C 5 G 2 T 1 R 2 … N 1 Given the pattern ATGTA • The shift table is:

  18. Classes in the text :Horspool example text : G T A R T R N A A G G A ... A T G T A A T G T A A T G T A A T G T A A 4 C 5 G 2 T 1 R 2 … N 1 Given the pattern ATGTA • The shift table is: …

  19. Classes in the text Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  20. Classes in the text: BNDM algorithm • How the comparison is made? Search for suffixes of T that are factors of the pattern x Text : Pattern : …that is denoted as D2 = 1 0 0 0 1 0 0 Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) • Which is the next position of the window? Depends on the value of the leftmost bit of D

  21. Classes in the text : BNDM example Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=( ) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 ) • The masks of bits are

  22. Classes in the text : BNDM example Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 ) • The masks of bits are

  23. Classes in the text : BNDM example • text : G T A R T R N A G G A C G ... A T G T A A T G T A Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) • The masks of bits are D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )

  24. Classes in the text : BNDM example • text : G T A R T R N A G G A C G ... A T G T A A T G T A Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) • The masks of bits are D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)

  25. Classes in the text : BNDM example • text : G T A R T R N A G G A C G ... A T G T A A T G T A A T G T A Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) • The masks of bits are D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0) …

  26. Classes in the text Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  27. BOM algorithm (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata

  28. Classes in the text: BOM example G T A G T T A G T A … and we try to find… : G T A R T R N A A T G… The we build the AFO of the inverse pattern of ATGTATG A T G T A T G It’s not possible any improvement!

  29. Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  30. Classes in the text: Set Horspool algorithm • How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns • Which is the next position of the window? ?

  31. Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4 4. Find the patterns

  32. Classes in the text: Set Horspool G T A T A T G G T A T A A T A A Search for the patterns ATGTATG,TATG,ATAAT,ATGTG text: ARTGNCTATGTGACA… It’s not possible any improvement!

  33. Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  34. Classes in the text: SBOM algorithm • How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata

  35. Classes in the text: SBOM example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATN C TAGC TA TA ATAATGTATG It’s not possible any improvement!

  36. Extended alphabets Classes in the: text pattern Horspool ✓ BNDM ✓ BOM ✗ Set-Horspool ✗ SBOM ✗

  37. Extended search Second part Classes in the pattern

  38. Classes in the pattern: Brute force algorithm • How the comparison is made? Text : over  Pattern : over 2|∑| From left to right: prefix We need the operation: belongs to a set ? ? • Which is the next position of the window? Text : Pattern : The window is shifted only one cell

  39. Classes in the pattern: Brute force algorithm A T N T R A T N T R When || < computer word Every subset is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A C T A G A G G A C G T A T G T A C T G ... I(T) and I(R) >0 I(A) and I(R) >0 I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …

  40. Classes in the text Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  41. Classes in the pattern: Horspool algorithm • How the comparison is made? Text : Pattern : Sufix search • Which is the next position of the window? a Text : Pattern : Shift until the next ocurrence of “a” in the pattern: a a a a a a We need a preprocessing phase to construct the shift table.

  42. Classes in the pattern: Horspool example A C G T Given the pattern ATNTR • The shift table is:

  43. Classes in the pattern: Horspool example A 2 C G T Given the pattern ATNTR • The shift table is:

  44. Classes in the pattern: Horspool example A 2 C 2 G T Given the pattern ATNTR • The shift table is:

  45. Classes in the pattern: Horspool example A 2 C 2 G 2 T Given the pattern ATNTR • The shift table is:

  46. Classes in the pattern: Horspool example A 2 C 2 G 2 T 1 text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T N T R Given the pattern ATNTR • The shift table is:

  47. Classes in the pattern: Horspool example A 2 C 2 G 2 T 1 text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T N T R A T G T A Given the pattern ATNTR • The shift table is: Shorter shifts!

  48. Classes in the text Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  49. Classes in the text: BNDM algorithm • How the comparison is made? Search for suffixes of T that are factors of the pattern x Text : Pattern : …that is denoted as D2 = 1 0 0 0 1 0 0 Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) • Which is the next position of the window? Depends on the value of the leftmost bit of D

  50. Classes in the pattern : BNDM example B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) • The masks of bits of symbols are Given the pattern ATNTR

More Related