117 Views

Download Presentation
## Multiple P attern M atching R evisited

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Multiple Pattern Matching Revisited**Robert Susik1, Szymon Grabowski1,Kimmo Fredriksson2 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland PSC, Prague, Sept. 2014**Multiple pattern matching**• The problem: • report all text T1..n positions i such that one of r patterns P1..m matches T for some 1 ≤ i ≤ n both over a common integer alphabet of size σ. • Usage: • antivirus scanning, • intrusion detection, • web searches, • etc.**Related work**• Aho–Corasick (1975), works in linear time, • Commentz–Walter (1979), based on Boyer–Moore (BM) algorithm - suffix-based approach, • Fredriksson and Grabowski (2009), an average-optimal filtering variant of the classic AC algorithm • Wu and Manber (1994), based on backward matching over a sliding text window, Aho-Corasick trie implementation for he, she, his and hers. Commentz-Walter trie implementation for he, she, his and hers. Wu and Manber Boyer Moore approach. Images taken from: S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M.Niranjan, "Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?"November 2012; Koloud Al-Khamaiseh Int. Journal of Engineering Research and Applications July 2014**Related work**• DAWG-match (Crochemore et al., 1999) and MultiBDM (Crochemore & Rytter, 1994), based on backward matching, linear in the worst case, complex, Multi-BNDM (Navarro & Raffinot, 1998) – bit-parallel version, simplified, • Set Backward Oracle Matching (Allauzen & Raffinot, 1999), similar as above but simpler and is very efficient in practice, • Succinct Backward DAWG Matching (Fredriksson, 2003), practical for huge pattern sets due to use of succinct index, • Faro & Külekci, use of the SSE technology, e.g. wsfp (word-size fingerprint instruction) operation used to identify text blocks that may contain a matching pattern (2012), • Salmela et al. tried a similar approach to ours (not verysuccessful for short patterns in their tests), 2006.**Shift-Or(Baeza-Yates & Gonnet, 1992)**• Shift-Or simulates a non-deterministic finite automaton (NFA), with bit-parallelism • Bit-parallelism: • Frequently used in stringology when the results of single operations are boolean or small integers • Many (even w, computer word size) operations can be made in parallel • Reinvented several times, but BY-G (1992)is the most known**Shift-Or – in work**gcaga B[g] = 01101 B[c] = 10111 B[a] = 11010 B[] – bit-vector for each alphabet symbol,m * bits in total. V := ~0; i := 0 while i < ndo V := (V<< 1) | B[T[i]] if (m–1)-th bit of V is 0 then report match at position i i := i + 1 Search Preproc T = gcatcgcagagatP = gcaga**Shift-Or**• Pros: • Fast: O(nm / w) time in the worst case • when m = O(w), it is linear in time • Cons: • Avg-case is the same as the worst-case but faster methods are possible**Average Optimal Shift-Or (AOSO)(Fredriksson & Grabowski,**2005, 2009) • Motivation: • Improve the avg-case of Shift-Or • Idea: • Sample T every k symbols: T’ = t0, tk, t2k, … • Need to match k subpatterns of P:P0, …, Pk–1, each sampled in the same way as T, starting from 0, 1, …, k–1 • When some subpattern matches, verify whether there is a true match**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g match of subpattern! verification in T – success**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern**AOSO – example**T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern**AOSO**• Pros: • Faster than Shift-Or: O(n log (m)/m) time in the avg case • Cons: • Needs verification to exclude false matches, not a big problem in practice**Multi-pattern AOSO (MAOSO)**• Idea: • Merge r patterns (input patterns) into one superimposed pattern • Check only one superimposed pattern, then exclude false matches • Example (for r = 2): • P0 = ATGG,P1 = ACTA • Merging: P* = [A][TC][GT][GA]**MAOSO – some details**• Just set the bit-vectors (in the manner of Shift-Or) if any of the symbols at given position of superimposed pattern is present • Use AOSO for such superimposed pattern • Problem: If r is large and (especially) σ small, then there’s a lot of verifications**Q-grams**Idea: grouping q successive text chars into supersymbols. New alphabet size: σq. Enlarging the alphabet may reduce the number of comparisons between the text and the pattern.**Alphabet mapping**Map large alphabet of σsymbols to smaller alphabet of σ’ symbols. We achieve this using bin-packing method. New alphabet (σ’ = 4)**Multi AOSO on q-Grams (MAG)**• Super-alphabet reduces verification number.We have p = O( (qr)/σq ) probability of match, so verification probability is O( p⌊ m / (kq) ⌋) and the cost is O(rqm) • q-gram based search makes steps bigger (equals q), or in other words text is smaller (n/q) • FAOSO runs in O(n/k · ⌈(m/q)/w⌉)time in our case, where w is the number of bits in computer word (typically 64).**Simple Multi AOSO on q-Grams (SMAG)**• Simpler version of previously mentioned method. In this case the whole text is encoded prior to starting the actual search algorithm, which is then more streamlined. • Total complexity is Ω(n), the time to encode the text. • A little faster search, but much longer preprocessing phase. • Maybe useful if text is searched multiple times in short period and we have space to store it in encoded form.**Experimental results**• Hardware: Intel Core i3 2100 3.1 GHz CPU 128KB L1, 512KB L2 and 3 MBL3 cache, 4GB of 1333MHz DDR3 RAM • Compiler: g++ version 4.8.1 with -O3 optimization • OS: Ubuntu 64-bit OS with kernel 3.11.0-17 • Text: taken from Pizza & Chili Corpus (http://pizzachili.dcc.uchile.cl), 200MB each • Tests: All source codes have been taken from authors and compiled on the same test machine(some of them cannot handle long patterns, ie. m=64).**Conclusions**• Our work can be seen as a newand quite successful combination of known building bricks. • The presented algorithm, MAG, usually wins with its competitors on the three test datasets (english and proteins, dna). • One of the key successful ideas was alphabet quantization (binning),which is performed in a greedy manner, after sorting the original alphabet by frequency.**Future work**• Different alphabet mapping techniques could improve efficiency. • Is it possible to choose the algorithm’sparameters in order to reach average optimality (for m = O(w))? • SSE instructions seem to offer great opportunities, especially for bit-parallel algorithms. • Dense codes (e.g., ETDC) for words or q-grams not only servefor compressing data (texts), but also enable faster pattern searches(our preliminary results are rather promising).