Boyer Moore Searches on Binary Texts

# Boyer Moore Searches on Binary Texts

## Boyer Moore Searches on Binary Texts

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Accelerating Boyer Moore Searches on Binary Texts Shmuel Tomi Klein Miri Kopel Ben-Nissan Bar Ilan University, ISRAEL

2. Background and motivation Boyer Moore algorithm New binary variant Analysis Experiments Summary Outline Background and motivation Boyer Moore algorithm New binary variant Analysis Experiments Summary

3. Important application of Automata: KMP BDM BM PATTERN MATCHING Boyer & Moore Match Backwards ! ! this-is-a-sample-text--- pattern

4. shift x contains no b Boyer – Moore Algorithm Mismatch – case 1: delta1 b does not occur inx y b u x a u

5. shift x b contains no b Boyer – Moore Algorithm Mismatch – case 2: delta1 b occurs inx y b u x a u

6. shift x c u Boyer – Moore Algorithm Mismatch – case 3: delta2 u reoccurs inx preceded by c≠a y b u x a u

7. shift x v Boyer – Moore Algorithm Mismatch – case 4: delta2 Only a suffixvofu reoccurs inx y b u x a u v

8. here is a simple example example here is a simple example example here is a simple example example delta1 example delta2 here is a simple example example here is a simple example example Boyer – Moore Example

9. this-is-a-sample-text--- pattern 0100101101011101000100110101001 1101100 Bit-level processing Problems of Binary Boyer & Moore most work by delta1 delta1 useless

10. Need for Binary Boyer & Moore Compressed Matching Given E(T) and P look for E(P) in E(T) rather than P in D(E(T)) Suggested Solution: BBBMM BlockedBinaryBoyerMooreMatching

11. k Text [ i ] Pat [ sh , j ] sh sl BBBMM

12. BBBMM More information in binary case ffghabdgttiocb sbgghj ASCII 01100010 01101010 BINARY

13. i – 1 i i + 1 T 101 P 101 100 101 01 BBBMM extended delta1

14. K T P sl k BBBMM Total size of delta1 tables: If too large, use limit value Size of delta1 tables reduced to

15. T P BBBMM Original delta1 : increase of text pointer BBBMM delta1 : shift size Mismatch not in last block Correct[sh,j]

16. T P BBBMM delta2

17. Analysis Assumption: random input Reasonable for compressed text Expected # comparisons till mismatch: Bit-wise: Blocked:

18. Analysis Expected # bits shifted after mismatch: Bit-wise: M Blocked: M’

19. Experiments English Bible (2.5MB) World Factbook (1.5MB) Text: Huffman encoded k = 8 Patterns: Random substrings of lengths 10 to 500

20. Bit-wise 1.5 Blocked 1.4 1.3 1.2 1.1 100 200 300 400 500 length of pattern Experiments: Average # comparisons between shifts

21. 100 Blocked 80 60 40 20 100 200 300 400 500 length of pattern Experiments: Average size of shifts Bit-wise

22. Bit-wise 500 BDM 400 Blocked 300 200 100 100 200 300 400 500 length of pattern Experiments: Average # comparisons for 1000 bits

23. Bit-wise BDM 300 Turbo-BDM 250 Blocked 200 150 100 50 100 200 300 400 500 length of pattern Experiments: Time to locate first occurrence (ms)

24. Summary Blocked variant of BM Faster than alternatives, Overhead 1-10 K Extensions: ASCII, words instead of characters

25. Thank you !