- 102 Views
- Uploaded on
- Presentation posted in: General

Fast Submatch Extraction using OBDDs

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Fast Submatch Extraction using OBDDs

Liu Yang1, Pratyusa Manadhata2, William Horne2,

Prasad Rao2, Vinod Ganapathy1

Rutgers University1

HP Laboratories2

Signatures

NIDS

Network traffic

Alerts

Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.

Web security compliance

Connectors (rule set)

SIEM

Email security compliance

Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.

Rule set

…

username=(.*), hostname=(.*)

…

username=Bob, hostname=Foo

Submatch extraction

$1 = Bob, $2 = Foo

- Non-deterministic finite automaton (NFAs)
- Space efficient, time inefficient

- Deterministic finite automaton (DFAs)
- Time efficient, states blow-up

- Recursive backtracking
- Fast in general
- Vulnerable to algorithmic complexity attacks

NFA (non-deterministic finite automaton)

Backtracking

Time

Our approach

DFA (deterministic finite automaton)

Ideal

Space

- A novel way of annotating capturing groups, tagged-NFAs
- Design of a novel technique on submatch extraction (called Submatch-OBDD)
- Extending Thompson’s algorithm
- Using Boolean functions to represent tagged-NFAs
- Using ordered binary decision diagrams (OBDDs) to improve time efficiency

- Evaluation and comparison with RE2 and PCRE

Note: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.

RegExps with capturing groups

Tagged-NFAs

Boolean Representations

OBDD representations

E = a*aa

NFA of regexp “a*aa”

Transition table T(x,i,y)

E = (a*)aa

Tag(E) = (a*)taa

1

/ t1

Tagged NFA of “(a*)aa” with submatch tagging t1

Extended transition table T(x,i,y,t) of the tagged NFA

RegExp=(a*)aa; Input: aaaa

{t1}

{t1}

{t1}

{t1}

1

2

3

accept

a

a

a

a

Frontier

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

{t1}

{t1}

{t1}

{t1}

1

2

3

accept

a

a

a

a

$1=aa

Frontier

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

Any path from an accept state to a start state generates a valid assignment of submatches.

Match test:

Submatch extraction:

n – size of tagged NFA

l – length of input string

Can we make the operations faster?

- Representing tagged NFAs using Boolean functions
- Updating frontiers in one-step using a single Boolean formula

- Using OBDDs to manipulate Boolean functions

RegExp: (a*)aa

(1 Λ a Λ 1 Λ t1)

V (1 Λ a Λ 2 Λ{})

V (2 Λ a Λ 3 Λ{})

T(x,i,y,t) =

Transition table

Next states

Start states

(1ΛaΛ 1 Λt1)

V (1ΛaΛ 2 Λ{})

aaaa

{1} Λ a Λ T(x,i,y,t)

Input symbol

Intermediate transitions

aaaa

(1ΛaΛ 1 Λ t1)

V (1ΛaΛ 2 Λ{})

V (2ΛaΛ 3 Λ{})

{1,2} Λ a Λ T(x,i,y,t)

Current states

aaaa

(1ΛaΛ 1 Λt1)

V (1ΛaΛ 2 Λ{})

V (2ΛaΛ 3 Λ{})

{1,2,3} Λ a Λ T(x,i,y,t)

…

Accept

The last input symbol

Start from the last symbol, going backwards

No output submatch tag

(1ΛaΛ1Λt1)

V (1ΛaΛ2Λ{})

V (2ΛaΛ3Λ{})

aΛ3 Λ

2ΛaΛ3Λ{}

aaaa

Intermediate transitions [4]

Previous state of 3

Accept state

Rename previous state as current state and continue

No output submatch tag

(1ΛaΛ1Λt1)

V (1ΛaΛ2Λ{})

V (2ΛaΛ3Λ{})

aΛ2Λ

1ΛaΛ2Λ{}

aaaa

Previous state of 2

Intermediate transitions [3]

Output submatch tag

(1ΛaΛ1Λt1)

V (1ΛaΛ2Λ{})

V (2ΛaΛ3Λ{})

aΛ1Λ

1ΛaΛ1Λ t1

aaaa

Intermediate transitions [2]

Previous state of 1

Output submatch tag

(1ΛaΛ1Λt1)

V (1ΛaΛ2Λ{})

aΛ1Λ

1ΛaΛ1Λ t1

aaaa

Intermediate transitions [1]

Previous state of 1

aaaa

$1=aa

t1

t1

Finding new frontiers after processing an input symbol:

Next frontiers =

Checking acceptance:

A back traversal approach: starting from the last input symbol.

Submatch extraction: the last consecutive sequence of characters that are assigned with ti

- Representation of tagged NFAs, match test, and submatch extraction using OBDDs
- OBDD representations for
- Transitions with submatch tags
- Intermediate transitions
- Submatch tags
- Set of start states
- Set of accept states
- Set of frontiers
- Input symbols

Toolchain in C++, interfacing with the CUDD*

Input strings / network traffic

Tagged NFAs

RE2TNFA

TNFA2OBDD

PATTERNMATCH

RegExps

OBDDs

No match

Matched at reg#

Submatches $1= …, $2 = …

*CUDD is a package for manipulation of Binary Decision Diagrams

- Data sets
- Snort-2009
- RegExps: 115 regexps with capturing groups from HTTP rules
- Traces
- 1.2GB department network traffic (average packet size 126 bytes)
- 1.3GB Twitter traffic (average packet size 1202 bytes)
- 1MB synthetic trace (average string length 311 bytes)

- Snort-2012
- RegExps: 403 regexps with capturing groups from HTTP rules
- Traces
- 1.2GB department network traffic (average packet size 126 bytes)
- 1.3GB Twitter traffic (average packet size 1202 bytes)
- 1MB synthetic trace (average string length 689 bytes)

- Firewall-504
- RegExps: 504 patterns from a commercial firewall F
- Trace: 87MB of firewall logs (average line size 87 bytes)

- Snort-2009

- Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM
- Two configurations on pattern matching
- Conf. S
- patterns compiled individually
- Compiled pattern matched sequentially against input traces

- Conf.C
- patterns combined with UNION and compiled
- combined pattern matched against input traces

- Conf. S

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set

- NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10]
- RE2 [Cox, code.google.com/p/re2]
- PCRE [www.pcre.org]
- TNFA [Laurikari et al., SPIRE’00]
- MDFA [Yu et al., ANCS’06]
- Hybrid FA [Becchi and Crowley, CoNEXT’07]
- XFA [Smith et al., Oakland’08]
- More – see paper for details

- A novel way of annotating capturing groups
- Submatch-OBDD: a novel technique on submatch extraction using OBDDs
- Feasibility study
- Submatch-OBDD achieves ideal performance when patterns are combined
- Faster than RE2 and PCRE when patterns are combined