1 / 26

Recuperació de la informació

Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

Download Presentation

Recuperació de la informació

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recuperació de la informació • Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html

  2. String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

  3. Regular expression A regular expression ℛ is a string on the set of simbols ΣU { ε, |, · , * , (, ) } which is recursively defined as: • ε (empty character) is a regular expression • A character of Σis a regular expression • ( ℛ ) is a regular expression • ℛ1 ·ℛ2is a regular expression • ℛ1 |ℛ2is a regular expression • ℛ * is a regular expression

  4. Regular lenguage The lenguage defined by a regular expression ℛis the set of strings generated by ℛ. The problem of searching for a regular expression in the text T is to find all the factors in T that belong to the lenguage.

  5. Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found

  6. Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found

  7. Search with a deterministic finit automata 2 b 3 b b 0 1 a b b 12 b a 0 1 a 3 Given the regular expression bb*(b|b*a) the NFA is As it’s not possible to spell the text out the NFA, the NFA is transformed into a DFA … What is the cost? And the search process…

  8. Search example with DFA The search on the text: b b b a a b a a b b … b b 12 b a 0 1 a 3 Given the regular expression bb*(b|b*a)and the NFA: …

  9. Methods Parse tree Search with bit-parallel Thompson automata DFA Search with deterministic finit automata Regular expression NFA Strings found

  10. Parse tree . ℛ ℛ1 ℛ2 | * ℛ1 ℛ2 ℛ Is a tree such that: - internal nodes are labeled by operators - leaves are labeled by characters of Σ and ε ( ℛ ) ℛ1 ·ℛ2 ℛ1 |ℛ2 ℛ *

  11. Parse tree: example . | * . b * b Given the regular expression bb*(b|b*a) the parse tree is: b b a

  12. NFA (Thompson automaton) a For a character a of Σ: . ℛ1 ℛ2 | ε ε ε ε ℛ1 ℛ2 * ε ε ℛ ε From the regular expression or from the parse tree we define the automaton:

  13. Thompsom automaton construction b b a b b bb*(b|b*a) . | b * . b b a * b

  14. NFA: ε-closure (states ε-equivalents) bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 0 1 4 12 b 10 11 ε 1 3 4 5 7 9 11 1, 2, 4, 5, 6, 8, 10 2, 3, 4, 5, 6, 8,10 4, 5, 6, 8, 10 5, 6, 8 6, 7, 8 9, 12 11, 12

  15. Bit-parallel Thompsom algorithm ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 0 1 4 12 b 11 10 Text: ababbbaab The bit-vector D mark the active states: at the begining At every step we shift to the right followed by an “and” operator with the mask of the last read character… The masks are (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …and the ε-closure extension of active states.

  16. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 0 1 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  17. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  18. Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0

  19. Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

  20. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1

  21. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1

  22. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1

  23. Bit-parallel Thompsom algorithm D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1

  24. Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0

  25. Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0

  26. Bit-parallel Thompsom algorithm b bb*(b|b*a) 6 7 b a 2 3 5 8 9 b E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 0 1 4 12 b 11 10 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More Related