1 / 33

Construction of Aho Corasick automaton in Linear time for Integer Alphabets

Construction of Aho Corasick automaton in Linear time for Integer Alphabets. Shiri Dori & Gad M. Landau University of Haifa. Overview. Classic Aho Corasick Our algorithm Goto Function Failure Function Combining the two Queries in O(m log| Σ |). Set Pattern Matching Problem.

iphigenie
Download Presentation

Construction of Aho Corasick automaton in Linear time for Integer Alphabets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Construction ofAho Corasickautomaton inLinear timefor Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa

  2. Overview • Classic Aho Corasick • Our algorithm • Goto Function • Failure Function • Combining the two • Queries in O(m log|Σ|)

  3. Set Pattern Matching Problem • Find patterns in text • P={P1, P2, ... Pq}, in T • Aho and Corasick solved it in ’75 • Generalized version of KMP • Uses a state machine

  4. h i he ir is her iri iris Aho Corasick - Example P = {her, iris, he, is}

  5. h i he ir is her iri iris Aho Corasick - Example P = {her, iris, he, is} Travel along the Goto function, which is a trie of all patterns If stuck, travel along KMP-style Failure link

  6. Aho Corasick - Example P = {her, iris, he, is}  Travel along the Goto function, which is a trie of all patterns h i When found a pattern, output it he ir is her iri If stuck, travel along KMP-style Failure link iris

  7. Aho Corasick Definitions • Goto function: a trie of the patterns • Failure function: for each label, the largest suffix which is a prefix of a pattern • KMP, but prefix of any pattern qualifies • Output function: patterns ending at this label

  8. Classic Aho Corasick – Analysis • Constructed in O(n) (cumulative pattern length) • Answered queries in O(m + k) • ... For constant alphabets only! • For integer alphabets, Σ=O(nc), algorithm changes depending on branching method • List, Array or Search Tree • Recent developments inspire for better! • Farach-97; Karkkäinen & Sanders-03; • Ko & Aluru-03; Kim, Sim, Park & Park-03

  9. Our Results • Our algorithm achieves better results: • Construction in O(n) time, O(n) space • Query in O(m log|Σ|) • Works for integer alphabets, Σ = O(nc)

  10. Algorithm: Goto Function • Sort patterns in time linear to their length • By building suffix array of Sp=$P1$P2$...$Pq$, and just ignoring non-pattern suffixes • Or by two-pass radix sort, O(D + Σ) = O(n) • Paige & Tarjan, ’87; Andersson & Nilsson, ‘94 • Now create the trie in lexicographic order • Hold a list of sons; insert each new node to the end of the list

  11. Example – Goto Function P = {the, than, this, then} Sorting Patterns P’ = {than, the, then, this}

  12. Example – Goto Function P’ = {than, the, then, this}  than, the, then, this than, the, then than, the than t th tha the thi than then this

  13. Example – Goto Function P’ = {than, the, then, this}  th t a e i th Sorted List, keep the tail tha the thi than then this

  14. Algorithm: Failure Function • We need to construct Failure links on trie • Original algorithm included traversing trie • We found a deep connection between: • Failure function of the patterns, and • Suffix Tree of the reversed patterns • Or Enhanced Suffix Array • Abouelhoda, Kurtz & Ohlebusch-04; Kim, Jeon & Park-04 • We’ll “learn by example”...

  15. h i  $ he ir is h (h) i (i) r (r) si (is) si (is) eh (he) her iri $ $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) siri (iris) iris $ $ $ $ $ Example – Failure Function P = {he, her, iris, is} • Failure function: “iris” “is” • The reverses: “siri”, “si” • “si” is a prefix of “siri” • (with $ so “is” prefix of a pattern) PR = {eh, reh, siri, si}

  16. Understanding Failure Function • Failure function is defined as: largest suffix, which is a prefix of any pattern • Reverse: “largest suffix” “largest prefix” • Any prefix of a label will be its ancestor in ST • Largest means nearest • “prefix of pattern” “suffix of pattern” • It will be a node in the ST, marked by a $ • So: closest ancestor which is marked by $

  17. Algorithm: Failure Function • We found a deep connection between: • Failure function of the patterns, and • Suffix Tree of the reversed patterns • We define Sp=$P1$P2$...$Pq$ • We define TR to be the suffix tree of (Sp)R • TR can be built in linear time • Can use Enhanced Suffix Array, ER, instead • Note: TR is a Generalized Suffix Tree • How will we link the trie and TR?

  18. h i  $ he ir is h (h) i (i) r (r) si (is) eh (he) her iri $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) iris $ $ $ $ Example – 1-to-1 Mapping Note: “r” doesn’t get a link since it’s not marked by a $

  19. h i  $ i (i) si (is) he ir is $ $ h (h) i (i) r (r) si (is) eh (he) iri (iri) siri (iris) her iri $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) iris $ $ $ $ Example – 1-to-1 Mapping

  20. Algorithm: Review • Build Goto function (trie) • Sort patterns • Construct trie • Build Failure function • Construct TR • Compute proper ancestor for $-marked nodes • Combine information • Through mapping, create Failure links on trie

  21. Adjustment for Integer Alphabet • We used recent developments (SA, ST) • Constructed Goto: using suffix array • Found a connection between Failure function and suffix trees • Thus, reduced the construction to O(n) • Yet, manage to keep queries at O(m log|Σ|) • Again - how?

  22. Queries in O(m log|Σ|) • We’ve built the trie in O(n) • But we have a sorted list • Search is compromised • Our simple solution…

  23. Example – Goto Function P’ = {than, the, then, this}  th t a e i th a e i tha the thi Array can be searched in log(#children) than then this

  24. Queries in O(m log|Σ|) • Once the trie is complete • Convert lists in each node to arrays • Array’s size is known; O(n) space overall • Binary search can now be employed • Reduce the time spent in each node to log(# children) = O(log|Σ|) • Can be applied to Suffix Tree built from Suffix Array + LCP

  25. The End Thanks!

  26. Algorithm: Combining the two • Build a 1-to-1 mapping between $-marked nodes in TR and trie nodes • We compute mapping through the string: • For each char in Sp, we keep its Goto node • For each suffix tree node, we know what indices it represents (in (Sp)R, and so in Sp) • Now, build Failure links atop the trie • Like we saw in the example

  27. Algorithm: Failure Function • For each node, find its “proper ancestor” • Closest ancestor marked with a $ • Found with a simple preorder traversal • The properties of TR ensure that... • For each failure link v1v2 • And their corresponding nodes, u1 and u2 • u2 = proper ancestor of u1 • If we link trie and TR, we find the Failure! • How will we link them?

  28. e h i t ey he ir is th eye her iri the iris thei their Example - automaton - Goto Travel along the Goto function, which is a trie of all patterns P = {her, their, eye, iris, he, is}

  29. $  e (e) h (h) i (i) r (r) si (is) t (t) ye (ey) $ $ $ $ $ $ eh (he) eye (eye) ht (th) ieht (thei) iri (iri) reh (her) ri (ir) siri (iris) $ $ $ $ $ $ $ $ eht (the) rieht (their) $ $ Example - TR P = {her, their, eye, iris, he, is}

  30. $  e h i t e (e) h (h) i (i) r (r) si (is) t (t) ye (ey) ey he ir is th $ $ $ $ $ $ eh (he) eye (eye) ht (th) ieht (thei) iri (iri) reh (her) ri (ir) siri (iris) eye her iri the $ $ $ $ $ $ $ $ iris thei eht (the) rieht (their) $ $ their Example - TR and Failure P = {her, their, eye, iris, he, is} iris  is the  he  e eye  e their  ir  

  31. TR - Reversed Suffix Tree • We defined Sp=$P1$P2$...$Pq$ • We define TR to be the suffix tree of (Sp)R • This tree has interesting properties: • Each trie node v is represented by exactly one TR node u, so that Label(v) = Label(u)R • In TR, a node’s label is a prefix of its child’s label; in the trie, it is a suffix of the original • A $-marked node in TR means that the original label is a prefix of a pattern

  32. Example - TR $  e (e) h (h) i (i) r (r) si (is) $ $ $ $ eh (he) iri (iri) reh (her) ri (ir) siri (iris) $ $ $ $ $ P = {her, iris, he, is}

  33. Example - TR and Failure • We took: • Failure of “their” “ir” (from “iris”) • Largest suffix, which is a prefix of a pattern • Their reverse strings are “rieht”, “ri” • Now prefix... its ancestor in a suffix tree! • To be a prefix of a pattern px, should be a suffix of the reverse pattern (px)R • So it will be in suffix tree, and end with a $ P = {her, their, eye, iris, he, is}

More Related