1 / 19

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection. Fang Yu Microsoft Research, Silicon Valley Work was done in UC Berkeley , jointly with Zhifeng Chen (Google Inc); Yanlei Diao (Umass, Amherst); T. V. Lakshman (Bell Labs); Randy H. Katz (UC Berkeley).

craig
Download Presentation

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Fang Yu Microsoft Research, Silicon Valley Work was done in UC Berkeley, jointly with Zhifeng Chen (Google Inc); Yanlei Diao (Umass, Amherst); T. V. Lakshman (Bell Labs); Randy H. Katz (UC Berkeley)

  2. Regular Expressions • Flexible way to describe pattern • Example: for detecting yahoo messenger traffic ^(ymsg|ypns|yhoo).?.?.?.?.?.?.? [lwt].*\xc0\x80 • Used in many payload scanning applications • L7-filter: protocol identifiers • Bro: intrusion patterns • SNORT: • No regular expression in April 2003 • 1131 out of 4867 intrusion rules contain regular expressions as of Jan 2006

  3. Challenges • Features specific to packet scanning applications • Large set of patterns, order of 100s or 1000s

  4. Design Space Automata-based Approaches DFA-based NFA-based • A group of states can be activated simultaneously • Only one state is activated Repeated Scan One Pass Scan (Space Problem) • High percentage of wildcards • NFA-based approaches can be slow, sometimes less than 1Mb/s • Start scanning from one position, if no match, start again at the next position • Good for parsers • Packets may not contain any patterns • No guarantee of high speed • Scan the input only once • Fast and deterministic throughput • Add .* before patterns • Some patterns generate very large DFA m Individual DFA for m patterns One composite DFA for m patterns • O(m) processing complexity for each input character • O(1) processing complexity for each input character Contributions Patterns (A|B)C and (A|D)E • Selectivelygroup patterns into k groups (e.g., k=3) • Avoid exponential memory growth • Further speed up matching process • Rewrite techniques to reduce memory usage • Make DFA-based approach feasible

  5. DFA Sizes of Regular Expressions • Typical patterns in network payload scanning applications Rewrite Rule 1 Rewrite Rule 2 Focus of this talk

  6. Design Considerations • Completeness of matching results for one pattern • Complete matching • Report all the possible substrings • E.g., a pattern ab* and an input abbb • Four possible matches, i.e., a, ab, abb, and abbb • Non-overlapping matching • Common practice: left-most longest match, shortest match results • In most payload scanning applications, for one pattern, reporting non-overlapping matching result is sufficient

  7. ε A U T H \s [\^n] [\^n] [\^n] [\^n] 100 states Patterns with Exponential DFA Sizes • Often for detecting buffer overflow attempts, e.g., .*AUTH\s[^\n]{100} • DFA needs to remember all the possible AUTH\s • A second AUTH\s can either match [^\n]{100} or be counted as a new match of the start of the pattern AUTH\s • Generate a DFA of >100,000 states • Can’t be efficiently processed by an NFA-based approach either Input AUTH\sAUTH\sAUTH\s\s AUTH\s\s\s … NFA for .*AUTH\s[^\n]{100}

  8. Rewriting Intuition • Only the first AUTH\s matters • If there is a ‘\n’ within the next 100 bytes • None of the AUTH\s matches the pattern • Otherwise, the first AUTH\s and the following characters have already matched the pattern Rewrite the pattern to: ([^A]|A[^U]|AU[^T]|AUT[^H]|AUTH[^\s]|AUTH\s[^\n]{0,99}\n)*AUTH\s[^\n]{100} generates a DFA of only 106 states • This rewritten pattern • Report different numbers of matches from the original pattern in identifying complete matches • Equivalent in identifying non-overlapping patterns

  9. Rewriting Effect on the SNORT Rule Set v

  10. Rewriting Effect on the SNORT Rule Set • Created scripts to automatically rewrite patterns • After rewriting, patterns in SNORT and Bro can be compiled into DFAs

  11. Design Choices Automata-based Approaches DFA-based NFA-based Repeated Scan One Pass Scan m Individual DFA for m patterns One composite DFA for m patterns • O(m) processing complexity for each input character • O(1) processing complexity for each input character Contributions • Selectivelygroup patterns into k groups (e.g., k=3) • Further speedup matching process • Avoid exponential memory growth • Rewrite techniques to reduce memory usage • Make DFA-based approach feasible

  12. State Explosion Problem • Randomly adding patterns from the L7-filters into one DFA

  13. Interactions of Regular Expressions • Some patterns generate DFA of exponential sizes • E.g., A DFA for pattern .*AB.*CD and .*EF.*GH

  14. Grouping Algorithms • Fixed local memory limitation(NPU or multi-core architectures) • Compute pair-wise interactive results, form a graph • Keep adding patterns until reaching limit • Pick a pattern with the fewest interactions to the new group • Fixed total memory limitation(General single-core CPU architecture) • First compute the DFA of individual patterns and compute the leftover memory size • Distribute the leftover memory evenly among ungrouped expressions

  15. Experimental Setup • Regular expression pattern sets • Linux application layer filer (L7-filter): 70 regular expressions • Pattern sets from Bro intrusion detection systems • HTTP related patterns: 648 patterns • Payload related patterns: 223 patterns • Packet traces: • MIT dump: with viruses and worms • Berkeley dump: normal traffic • Scanners: • Generated one pass scanning DFA scanner • A NFA-based scanner Pcregrep • A repeated scanning DFA parser generated by flex

  16. Grouping Results for Patterns in L7-filter (70 patterns) Results of grouping algorithms for fixed total memory No grouping Sum of individual DFAs No extra memory cost 70/12=5.83 times less processing per character 70/3=23.3 times less processing per character 6.83MB of memory

  17. Throughput Analysis • For Linux L7-filter (70 patterns) • Using PCs with 3Ghz single core CPU and 4GB memory

  18. Comparisons to Other Approaches • DFA OP is • 48 to 704 times faster over the NFA implementation • 12-42 times faster than the commonly used DFA-based parser • Use 2.6 to 8.4 times memory NFA—Pcregrep DFA RP – Flex generated DFA-based repeated scan engine DFA OP – Our DFA one pass scanning engine

  19. Conclusions • High speed regular expression matching scheme • Proposed two rewrite rules • DFA-based approach is possible with our rewriting rules • Can rewrite complicated patterns from our pattern sets • In other pattern sets, there may be patterns not covered by our rewriting rules. • Developed grouping algorithm to selectively group patterns together • Orders of magnitude faster than existing solutions • Can be applied to FPGA or ASIC based approaches as well

More Related