Efficient Regular Expression Evaluation: Theory to Practice Michela Becchi and Patrick Crowley

Efficient Regular Expression Evaluation: Theory to PracticeMichela Becchi and Patrick Crowley ANCS’08

Motivation • Size and complexity of rule-set increased in recent years • Snort, as of November 2007 • 8536 rules, 5549 Perl Compatible Regular Expressions • 99% with character ranges ([c1-ck],\s,\w…) • 16.3 % with dot-star terms (.*, [^c1..ck]* • 44 % with counting constraints (.{n.m}, [^c1..ck]{n,m}) • Several proposals to accelerate regular expression matching • FPGA • Memory centric architecture

Objectives • Can we converge distinct algorithmic techniques into a single proposal also for large data-sets? • Can we apply techniques intended for memory centric architectures also on FPGAs? Provide tool to allow anybody to implement a high throughput DPI system on the architecture of choice

Target Architectures Regex-Matching Engine Memory-centric architectures FPGA logic Generalpurpose processors Network processors FPGA / ASIC + memory available parallelism

Challenges DFA Memory-centric architectures NFA Generalpurpose processors Network processors FPGA / ASIC + memory FPGA logic • Logic cell utilization • Clock frequency • Memory space • Memory bandwidth

D2FA: default transition compression • Observations: • DFA state: set of |∑| next state pointers • Transition redundancy • Idea: • Differential state representation through use of non-consumingdefault transitions • In general: s3 a s3 a b s4 s1 b s4 s1 c s5 c s5 s3 a b s4 s2 c s6 c s2 s6 DEFAULT PATH ∑ c1 c2 c1 c4 c3

D2FA algorithms • Problem: set default transitions so to • Maximize memory compression • Minimize memory bandwidth overhead • [Kumar et al, SIGCOMM’06] • Bound dpMAXon max default path length • O(dpMAX+1) memory accesses per input char • Better compression for higher dpMAX • [Becchi et al, ANCS’07] • Only backward-directed default transitions (skipping k levels) • Amortized memory bandwidth O((k+1/k)N) on N input chars • Depth-first traversal → at DFA creation Memory bandwidth = O((k+1/k)N) Time complexity =O(n2) Space complexity =O(n) Memory bandwidth = O((dpMAX+1)N) Time complexity = O(n2logn) Space complexity = O(n2) vs. Compression w/ k=1 ~ compression w/ dpMAX=∞

0 [a-z] [1-2] [A-Z] 3/1 3/1 [0-2] [a-zA-Z] 1 1 3 [0-9] [a-z] 0 4/2 4/2 [0-2] [a-zA-Z] 0 0 [a-z] 0 [a-z] 0 [1-2] [A-Z] [2-3] [0-9B-Z] 5/3 5/3 [B-Z] 2 2 2 A 1 1 A DFA alphabet reduction Effective for: Ignore-case regex Char-ranges Never used chars = + Alphabet translation table 8

1 aa ab 2 Multiple-stride DFAs • [Brodie et al, ISCA 2006] • Idea: • Process stride input chars at a time • Observations: • Mechanism used on small DFAs (1-2 regex) • No distinct accepting state handling DFA w/ stride 2 DFA 0

Multiple stride + alphabet reduction • Stride s → Alphabet ∑s • ∑=ASCII alphabet ►| ∑2|=2562=65,536; | ∑4|=2564~4,294M • Effective alphabet much smaller • Char grouping: [a-cef]a, [b-f]b • Alphabet reduction may be necessary to make stride doubling feasible on large DFAs 2-DFA DFA

Multiple stride + default transitions • Compression • Default transitions eliminate transition redundancy • In multiple stride DFAs • # of states does not substantially change • # of transitions per state increases exponentially (  stride) • Fraction distinct/total transitions decreases • Increased potential for compression! • Accepting state handling • Duplicated states have same outgoing transitions as original states but different depth • Default transition will remove all outgoing transitions from new accepting states DFA 2-DFA

Multiple stride + default transitions (cont’d) • Problem: • For large ∑ and stride, uncompressed DFA may be unfeasible • Out of memory when generating a 2K node, stride 4 DFA on a Linux machine w/ 4GB memory • Solution • Perform default transition compression during DFA creation • Use [Becchi et al, ANCS 2006] compression algorithm • In the situation above, only 10% memory used

alphabet reduction ||=25-44 avg 1-2 labeled tx/state default transition compression Stride-2 transformation 96.3-98.5% transitions removed Stride-2 DFA Compressed DFA alphabet reduction ||=53-470 • Same memory bandwidth requirement • Initial size=40X-80X final size default transition compression Avg 3-5 labeled tx/state 97.9-99.5% transitions removed Compressed Stride-2 DFA Putting everything together… 1-22 regex 48-1,940 states DFA

NFA • ab+cd • ab+ce • ab+c.*f • b[d-f]a • bdc

Multiple stride + alphabet reduction • Stride doubling • Alphabet reduction: • Clustering-based algorithm as for DFA, but sets of target states are compared Avoid new state creation Keep multiple transitions on the same symbol separated

FPGA implementation One-hot encoding [Sidhu & Prasanna] Quine-McCluskey like minimization scheme + logic reduction schemes (c1=b OR c1=B) AND NOT (c2=a OR c2=A)

FPGA Results - throughput

FPGA Results – logic utilization #s=7,864 ∑1=64 ∑2=2206 • Utilization: • 8-46% on XC5VLX50 device (7,400 slices) • XC5VLX330 device has 51,840 slices #s=2,086 ∑1=78 ∑2=1,969 #s=2,147 ∑1=68 ∑2=1640

ASIC – projected results Regex partitioning into multiple DFAs Content addressing w/ 64 bit words: • 98% states compressed w/ stride 1 • 82% states compressed w/ stride 2 Throughput: SRAM@500 MHz • 2-4 Gbps for stride 1 • 4-8 Gbps for stride 2 Alternative representation: decoders in ASIC or instruction memory

Conclusion • Algorithm: • Combination of default transition compression, alphabet reduction and stride multiplying on potentially large DFAs • Extension of alphabet reduction and stride multiplying to NFAs • FPGA Implementation: • Use of one-hot encoding w/ incremental improvement schemes • Logic minimization scheme for alphabet reduction & decoding • Additional aspects: • Multiple flow handling: FPGA vs. memory centric architectures • Design improvements tailored to specific architectures and data-sets: • Clustering into smaller NFAs and DFAs to allow smaller alphabets w/ larger strides

Thank you! • Questions? http://regex.wustl.edu

Efficient Regular Expression Evaluation: Theory to Practice Michela Becchi and Patrick Crowley