1 / 24

An Improved Algorithm to Accelerate Regular Expression Evaluation

An Improved Algorithm to Accelerate Regular Expression Evaluation. Author : Michela Becchi , Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture for networking and communications systems ,  2007 Presenter: Ching Hsuan Shih Date: 2014/02/26. Outline.

adora
Download Presentation

An Improved Algorithm to Accelerate Regular Expression Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Improved Algorithm to Accelerate Regular Expression Evaluation Author: MichelaBecchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture for networking and communications systems, 2007 Presenter: ChingHsuan Shih Date: 2014/02/26

  2. Outline Introduction Motivation The Proposal Reducing the Alphabet Encoding Experimental Evaluation

  3. I. Introduction • Signature-based deep packet inspection has taken root as a dominant security mechanism in networking devices and computer systems. • Regular expressions are more expressive than simple patterns of strings and therefore able to describe a wider variety of payload signatures. • There has been a amount of recent work on implementing regular expressions, particularly with representations based on deterministic finite automata (DFA).

  4. I. Introduction (Cont.) • DFAs have attractive properties that explain the attention they have received. • They have predictable and acceptable memory bandwidth requirements. • For any given regular expression, a DFA with a minimum number of states can be found [3].

  5. I. Introduction (Cont.) • DFAs corresponding to large sets of regular expressions containing complex patterns can be prohibitively large in terms of numbers of states and transitions. • Yu et al. [15] have proposed segregating rules into multiple groups and evaluating the corresponging DFAs concurrently. • Delayed Input DFA (D2FA) [9] redundant transitions common to a pair of states with a single default transition.

  6. I. Introduction (Cont.) • D2FA has three weaknesses • It requires a user-provided parameter value which can only be determined experimentally for a given rule-set. • It creates a data-structure whose worst-case paths may be traversed for each input character processed. • It requires multiple passes over large support data structures during the construction phase. • We propose an improved simplified algorithm for building default transitions that addresses the problems above.

  7. II. Motivation In this section, we describe the D2FA approach [9]. • The basic goal of the D2FA is to reduce the amount of memory needed to represent all the state transitions in a DFA.

  8. II. Motivation (Cont.) • During the string matching operation, the traversal of D2FA will be performed according to the Aho-Corasick algorithm [1], treating default transitions as failure pointers. • The heuristic proposed in [9] to build a D2FA can be explored systematically as a maximum spanning tree problem on an undirected graph. • This maximum spanning tree problem can be solved with Kruskal’s algorithm [5].

  9. II. Motivation (Cont.) • After the operation of Kruskal’s algorithm, the root of each tree can be selected. • The node having the smallest maximum distance from any vertices within the same tree is chosen. • Direct all default transitions towards the root of the default transition tree. • In order to limit the maximum default path length, a heuristic is proposed to address this problem by determining a maximum spanning tree forest with bounded diameter.

  10. II. Motivation (Cont.)

  11. III. The Proposal • We now take advantage of a simple fact: • DFA traversal always starts at a single initial state S0 • We propose a more general compression algorithm which leads to a traversal time bound independent of the maximum default transition path length.

  12. III. The Proposal (Cont.) Definition: For each state s, we define its depth as the minimum number of states visited when moving from s0 to s in the DFA.

  13. III. The Proposal (Cont.) • Lemma: With any string of length N, a 2N time bound is guaranteed on all D2FA having only “backwards” transitions. • A string of length N implies N labeled transitions to be followed and the number of default transitions is always at least one less than the number of labeled transitions taken. • For a string of length N, the total number of state traversals cannot be higher than 2N-1.

  14. III. The Proposal (Cont.) 3.1 Problem Formulation The problem can be now formulated as an instance of maximum spanning tree on a directed graph.

  15. III. The Proposal (Cont.) 3.2 An example

  16. III. The Proposal (Cont.) 3.3 Algorithm The whole problem is reduced to having each state select the state with lower depth having the most number of outgoing transitions in common with it.

  17. III. The Proposal (Cont.)

  18. IV. Reducing the Alphabet The basic idea is the following: In an alphabet ∑, two symbols ci and cjwill fall into the same class if they are treated the same way in all DFA states. In other words, given the transition function δ(states, Σ)→states, δ(s,ci)= δ(s,cj) for each state s belonging to the DFA. In practical scenarios (ASCII alphabet) this table will contain 256 entries, with a maximum width of 1 byte (for heavily compressed alphabets 5-6 bits per character may suffice).

  19. V. Encoding 5.1 Bitmaps A scheme [18] consists of associating a bitmap as large as the alphabet size to each DFA state. Bits corresponding to uncompressed labeled transitions present in the current state can be set to 1; the remaining bits are set to 0. State identifiers can be simply represented through their base address in memory. The length of the necessary bitmaps can substantially decrease after alphabet reduction.

  20. V. Encoding (Cont.) 5.2 Content addressing • A technique [16] consists in representing state identifiers with content labels, which are stored in memory as next state transitions. • A state content label contains several fields: • A state discriminator • The list of characters for which a labeled transition is defined • An identifier for the default transition state • The size of a content label depends on the number of labeled transitions defined for the corresponging state.

  21. VI. Experimental Evaluation

  22. VI. Experimental Evaluation (Cont.)

  23. VI. Experimental Evaluation (Cont.)

  24. VI. Experimental Evaluation (Cont.)

More Related