1 / 27

Multiple P attern M atching R evisited

Multiple P attern M atching R evisited. Robert Susik 1 , Szymon Grabowski 1 , Kimmo Fredriksson 2. 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland. PSC, Prague, Sept. 2014.

kohana
Download Presentation

Multiple P attern M atching R evisited

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Pattern Matching Revisited Robert Susik1, Szymon Grabowski1,Kimmo Fredriksson2 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland PSC, Prague, Sept. 2014

  2. Multiple pattern matching • The problem: • report all text T1..n positions i such that one of r patterns P1..m matches T for some 1 ≤ i ≤ n both over a common integer alphabet of size σ. • Usage: • antivirus scanning, • intrusion detection, • web searches, • etc.

  3. Related work • Aho–Corasick (1975), works in linear time, • Commentz–Walter (1979), based on Boyer–Moore (BM) algorithm - suffix-based approach, • Fredriksson and Grabowski (2009), an average-optimal filtering variant of the classic AC algorithm • Wu and Manber (1994), based on backward matching over a sliding text window, Aho-Corasick trie implementation for he, she, his and hers. Commentz-Walter trie implementation for he, she, his and hers. Wu and Manber Boyer Moore approach. Images taken from: S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M.Niranjan, "Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?"November 2012; Koloud Al-Khamaiseh Int. Journal of Engineering Research and Applications July 2014

  4. Related work • DAWG-match (Crochemore et al., 1999) and MultiBDM (Crochemore & Rytter, 1994), based on backward matching, linear in the worst case, complex, Multi-BNDM (Navarro & Raffinot, 1998) – bit-parallel version, simplified, • Set Backward Oracle Matching (Allauzen & Raffinot, 1999), similar as above but simpler and is very efficient in practice, • Succinct Backward DAWG Matching (Fredriksson, 2003), practical for huge pattern sets due to use of succinct index, • Faro & Külekci, use of the SSE technology, e.g. wsfp (word-size fingerprint instruction) operation used to identify text blocks that may contain a matching pattern (2012), • Salmela et al. tried a similar approach to ours (not verysuccessful for short patterns in their tests), 2006.

  5. Shift-Or(Baeza-Yates & Gonnet, 1992) • Shift-Or simulates a non-deterministic finite automaton (NFA), with bit-parallelism • Bit-parallelism: • Frequently used in stringology when the results of single operations are boolean or small integers • Many (even w, computer word size) operations can be made in parallel • Reinvented several times, but BY-G (1992)is the most known

  6. Shift-Or – in work gcaga B[g] = 01101 B[c] = 10111 B[a] = 11010 B[] – bit-vector for each alphabet symbol,m * bits in total. V := ~0; i := 0 while i < ndo V := (V<< 1) | B[T[i]] if (m–1)-th bit of V is 0 then report match at position i i := i + 1 Search Preproc T = gcatcgcagagatP = gcaga

  7. Shift-Or • Pros: • Fast: O(nm / w) time in the worst case • when m = O(w), it is linear in time • Cons: • Avg-case is the same as the worst-case but faster methods are possible

  8.  Average Optimal Shift-Or (AOSO)(Fredriksson & Grabowski, 2005, 2009) • Motivation: • Improve the avg-case of Shift-Or • Idea: • Sample T every k symbols: T’ = t0, tk, t2k, … • Need to match k subpatterns of P:P0, …, Pk–1, each sampled in the same way as T, starting from 0, 1, …, k–1 • When some subpattern matches, verify whether there is a true match

  9. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern

  10. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern

  11. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern

  12. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g match of subpattern! verification in T – success

  13. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern

  14. AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern

  15. AOSO • Pros: • Faster than Shift-Or: O(n log (m)/m) time in the avg case • Cons: • Needs verification to exclude false matches, not a big problem in practice

  16. Multi-pattern AOSO (MAOSO) • Idea: • Merge r patterns (input patterns) into one superimposed pattern • Check only one superimposed pattern, then exclude false matches • Example (for r = 2): • P0 = ATGG,P1 = ACTA • Merging: P* = [A][TC][GT][GA]

  17. MAOSO – some details • Just set the bit-vectors (in the manner of Shift-Or) if any of the symbols at given position of superimposed pattern is present • Use AOSO for such superimposed pattern • Problem: If r is large and (especially) σ small, then there’s a lot of verifications

  18. Q-grams Idea: grouping q successive text chars into supersymbols. New alphabet size: σq. Enlarging the alphabet may reduce the number of comparisons between the text and the pattern.

  19. Alphabet mapping Map large alphabet of σsymbols to smaller alphabet of σ’ symbols. We achieve this using bin-packing method. New alphabet (σ’ = 4)

  20. Multi AOSO on q-Grams (MAG) • Super-alphabet reduces verification number.We have p = O( (qr)/σq ) probability of match, so verification probability is O( p⌊ m / (kq) ⌋) and the cost is O(rqm) • q-gram based search makes steps bigger (equals q), or in other words text is smaller (n/q) • FAOSO runs in O(n/k · ⌈(m/q)/w⌉)time in our case, where w is the number of bits in computer word (typically 64).

  21. Simple Multi AOSO on q-Grams (SMAG) • Simpler version of previously mentioned method. In this case the whole text is encoded prior to starting the actual search algorithm, which is then more streamlined. • Total complexity is Ω(n), the time to encode the text. • A little faster search, but much longer preprocessing phase. • Maybe useful if text is searched multiple times in short period and we have space to store it in encoded form.

  22. Experimental results • Hardware: Intel Core i3 2100 3.1 GHz CPU 128KB L1, 512KB L2 and 3 MBL3 cache, 4GB of 1333MHz DDR3 RAM • Compiler: g++ version 4.8.1 with -O3 optimization • OS: Ubuntu 64-bit OS with kernel 3.11.0-17 • Text: taken from Pizza & Chili Corpus (http://pizzachili.dcc.uchile.cl), 200MB each • Tests: All source codes have been taken from authors and compiled on the same test machine(some of them cannot handle long patterns, ie. m=64).

  23. Experimental results, varying r

  24. Experimental results, varying m

  25. Experimental results, varying q

  26. Conclusions • Our work can be seen as a newand quite successful combination of known building bricks. • The presented algorithm, MAG, usually wins with its competitors on the three test datasets (english and proteins, dna). • One of the key successful ideas was alphabet quantization (binning),which is performed in a greedy manner, after sorting the original alphabet by frequency.

  27. Future work • Different alphabet mapping techniques could improve efficiency. • Is it possible to choose the algorithm’sparameters in order to reach average optimality (for m = O(w))? • SSE instructions seem to offer great opportunities, especially for bit-parallel algorithms. • Dense codes (e.g., ETDC) for words or q-grams not only servefor compressing data (texts), but also enable faster pattern searches(our preliminary results are rather promising).

More Related