Improved Two-Way Bit-parallel Search

Improved Two-Way Bit-parallel Search Branislav Durian, Tamanna Chhabra Sukhpal Singh Ghuman, Tommi Hirvola Hannu Peltola, Jorma Tarhio

String matching • String matching can be classified into: • Exact string matching • Approximate string matching • K mismatches • K errors

ti-j ti ti+j m m m Our Approach • New sublinear variations of Shift-Or,Shift-And, and Shift-Add algorithms which apply bit-parallelism. • Key idea: two-way loop of j where text characters ti−j and ti+j are handled together. • Next alignment starts at ti+m.

Bit-parallelism • Takes the advantage of internal parallelism of bit operations inside a computer word. • Many values in a single word are updated with a single operation. • Many operations of an algorithm can be performed faster.

Previous algorithms: BNDM and its variants • BNDM(Backward Nondeterministic DAWG Matching) is the bit-parallel simulation of an earlier algorithm called BDM (Backward DAWG Matching). • BDM scans the alignment window from right to left and skips characters using a suffix automaton.

Previous algorithms: Shift-Or and its variants • The Shift-Or algorithm was the first string matching algorithm applying bit- parallelism. • Operands in the algorithm are bit-vectors and the essential bit-vector containing the state of the automaton is called the state vector. • The state vector is updated with the bit-shift and OR operations.

Previous algorithms: For the k-mismatches problem • Shift-Add is a bit-parallel algorithm for the k-mismatches problem. • A state vector D of m states is used to represent the state of the search.

Our Algorithms • Exact string matching • TSO (Two-way Shift-Or) • TSA (Two-way Shift-And) • Approximate string matching with k mismatches • TSAdd (Two-way Shift-And) • Tuned Shift-Add

TSO • TSO (Two-way Shift-Or) uses the same occurrence vectors B for characters as the original Shift-Or. • The outer loop traverses the text with a fixed step of m characters. At each step i, an alignment window ti-m+1,…, t i+m-1 is inspected.

Example of working in the inner loop of TSO.

Example (Cont…) T= …x a b c a b c a b x… a D= 1 0 1 1 0 j=1 c 1 1 0 1 1 b 0 1 1 0 1 D= 1 0 1 1 0 j=2 b 0 1 1 0 1 c 1 1 0 1 1 D= 1 0 1 1 0

Example (Cont…) T= …x a b c a b c a b x… D= 1 0 1 1 0 j=3 a 1 0 1 1 0 a 1 0 1 1 0 D= 1 0 1 1 0 j=4 x 1 1 1 1 1 b 0 1 1 1 1 D= 1 0 1 1 0 E= 0 1 0 0 1

TSA • Shift-And is a dual method of Shift-Or. • TSA applies Shift-And and is a dual method of TSO.

TSAdd for k mismatches • Two-way approach in exact matching is successful due to simple analogy to the one-way algorithm (Shift-Or, Shift-And). • key trick: To use the overflow bits in the state vector D. • Logical AND operation between the occurrence vector and the right shifted complemented state vector. • This idea is applied in the Two-way Shift-Add.

Tuned Shift-Add • Tuned Shift-Add is a minimalist version of Shift-Add algorithm. • If bitvectors fit into computer register, the worst- and average-case complexity of the original Shift-Add algorithm O(n). • The original Shift-Add algorithm is using an overflow vector in addition to the state vector.

Analysis - TSO • TSO is linear in the worst case and sub-linear in the average case. • The outer loop of TSO is executed n/m times. In each round, the inner loop is executed at most m − 1 times. • The most trivial implementation of popcount requires O(m) time. So the total time in the worst case is O(nm/m) = O(n). • The same analysis applies to TSA.

Analysis - TSAdd • The outer loop of TSAddq is executed n/m times, and in each iteration O(m) text characters are read and O(m) occurrences are reported. • Thus, the total time complexity is O(n/m)· O(m + m) = O(n) for the worst case. • On the average case TSAdd is sub-linear. It can been seen from the test results where the search time decreases when m gets larger.

Analysis - Tuned Shift-Add • The worst- and average-case complexity of the original Shift-Add algorithm O(n). • Tuned Shift-Add is linear.

Experimental Results • In the test runs we used binary, DNA, and English texts. • The best execution times have been put in boxes in the tables represented following slides. • It is clearly evident from the tables that our algorithms run faster that the previous algorithms, especially for larger larger pattern length.

Search time (ms) for Binary dataPattern Length

Search time (ms) for DNA dataPattern Length

Search time (ms) for English data Pattern length

Algorithms for k mismatchesSearch times (ms) k = 1

Search times (ms) for k=2

Search times (ms) for k=3

Conclusion • The new algorithms and their tuned versions are efficient both in theory and practice. • They run in linear time in the worst case and in sublinear time in the average case.

THANK YOU

Improved Two-Way Bit-parallel Search

Improved Two-Way Bit-parallel Search

Presentation Transcript

Two-Way ANOVA

Two-way ANOVA

An Improved Two-Way Partitioning Algorithm with Stable Performance

Two Way Radios

Two Way Tables

Two-Bit Matthews

Two-Bit

Two-Way Tables

Two – Way Tables

Parallel Deposit (bit scatter)

Chapter 3 Parallel Search

Two-Way Communications

TWO WAY

TWO WAY

Two way tables

Two-Way ANOVA

Parallel Search Algorithm

Two Way Radios

Two Way Radio

Two-Way Radio

Chapter 3 Parallel Search

Two Way ANOVAs