Loading in 2 Seconds...

Efficient String Matching : An Aid to Bibliographic Search

Loading in 2 Seconds...

- By
**mairi** - Follow User

- 115 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Efficient String Matching : An Aid to Bibliographic Search' - mairi

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Efficient String Matching : An Aid to Bibliographic Search

Alfred V. Aho and Margaret J. Corasick

Bell Laboratories

Virus Definition

- Each virus has its peculiar signature
- Example in ClamAV
- _0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed
- _0017_0001_000 virus index
- Hex(21)=Dec(33)=‘!’
- Match the signature for detecting virus

Regular Expression

- Use RE to describe the signature
- ? can be any one char
- W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff00000000
- * can be any chars (including no char)
- Oror-fam (Clam)=495243*56697275*53455859330f5455*4b617a61*536e617073686f
- {n1-n2}, there are n1~n2 chars between two parts
- Worm.Bagle.AG-empty (Clam)=6e74656e742d547970653a206170706c69636174696f6e2f6f637465742d73747265616d3b{40-130}2d2d2d2d2d2d2d2d

Introduction

- Locate all occurrences of any of a finite number of keywords in a string of text.
- Consists of two parts :
- constructing a finite state pattern matching machine from the keywords
- using the pattern matching machine to process the text string in a single pass.

Pattern Matching Machine(1)

- Our problem is to locate and identify all substrings of x which are keywords in K.
- K : K={y1,y2,…,yk} be a finite set of strings which we shall call keywords
- x : x is an arbitrary string which we shall call the text string.
- The behavior of the pattern matching machine is dictated by three functions: a goto function g, a failure function f, and an output function output.

Pattern Matching Machine(2)

- g (s,a) = s’ or fail：maps a pair consisting of a state and an input symbol into a state or the message fail.
- f (s) = s’：maps a state into a state, and is consulted whenever the goto function reports fail.
- output (s) = keywords：associating a set of keyword (possibly empty) with every state.

Pattern Matching Machine Example with keywords {he,she,his,hers}

- Let s be the current state and a the current symbol of the input string x.
- Operating cycle
- If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.
- If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.

Example

- Text: u s h e r s

State: 0 0 3 4 5 8 9

2

- In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)

Example Cont’d

- In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.
- Since g(5,r)=fail, M enters state 2=f(5) . Then since g(2,r)=8, M enters state 8 and advances to the next input symbol.
- No output is generated in this operating cycle.

Algorithm 1. Pattern matching machine.

Input. A text string x = a1 a2 … a nwhere each a iis an input symbol

and a pattern matching machine M with goto function g, failure

function f, and output function output, as described above.

Output. Locations at which keywords occur in x.

Method.

begin

state ←0

fori ← 1 until n do

begin

whileg (state, a i ) = fail dostate ← f(state)

state ← g (state, a i )

if output (state)≠ empty then

begin

printi

print output (state)

end

end

end

Construction the functions

- Two part to the construction
- First：Determine the states and the goto function.
- Second：Compute the failure function.
- Output function start at first, complete at second.

Construction of Goto function

- Construct a goto graph like next page.
- New vertices and edges to the graph, starting at the start state.
- Add new edges only when necessary.
- Add a loop from state 0 to state 0 on all input symbols other than the first one in each keyword.

Algorithm 2

Algorithm 2. Construction of the goto function.

Input. Set of keywords K = {yl, y2, . . . . . yk}.

Output. Goto function g and a partially computed output function

output.

Method. We assume output(s) is empty when state s is first created,

and g(s, a) = fail if a is undefined or if g(s, a) has not yet

been defined. The procedure enter(y) inserts into the goto

graph a path that spells out y.

begin

newstate ← 0

for i ← 1 until k doenter(y i )

for all a such that g(0, a) = fail do g(0, a) ← 0

end

procedureenter(a 1 a 2 … a m):

begin

state ←0; j ← 1

whileg (state, aj )≠ fail do

begin

state ← g (state, aj)

j ← j + l

end

for p ← j until m do

begin

newstate ← newstate + 1

g (state, ap ) ← newstate

state ← newstate

end

output(state) ← { a 1 a 2 … a m}

end

Construction of Failure function

- Depth of s：the length of the shortest path from the start state to state s.
- The states of depth d can be determined from the states of depth d-1.
- Make f(s)=0 for all states s of depth 1.

Construction of Failure function Cont’d

- Compute failure function for the state of depth d, each state r of depth d-1：
- 1. If g(r,a)=fail for all a, do nothing.
- 2. Otherwise, for each a such that g(r,a)=s, do the following：
- a. Set state=f(r) .
- b. Execute state ←f(state) zero or more times, until a value for state is obtained such that g(state,a)≠fail .
- c. Set f(s)=g(state,a) .

Algorithm 3

Algorithm 3. Construction of the failure function.

Input. Goto function g and output function output from Algorithm 2.

Output. Failure function fand output function output.

Method.

begin

queue ← empty

for each a such that g(0, a) = s≠0 do

begin

queue ← queue ∪ {s}

f(s) ← 0

end

begin

let r be the next state in queue

queue ← queue - {r}

for each asuch that g(r, a) = s≠fail do

begin

queue ← queue ∪ {s}

state ← f(r)

whileg (state, a) = fail dostate ← f(state)

f(s) ← g(state, a)

output(s) ←output(s) ∪ output(f(s))

end

end

end

About construction

- When we determine f(s)=s’, we merge the outputs of state s with the output of state s’.
- In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.
- To avoid above, we can use the deterministic finite automaton, which discuss later.

Properties of Algorithms 1,2,3

- Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword.
- Proof :
- Suppose u=a1a2…aj, and a1a2…aj-1 represents state r, let r1,r2,…,rn be the sequence of states : 1. r1=f(r) ; 2. ri+1=f(ri) ; 3.g(ri,aj)=fail for 1≦i＜n ; 4.g(rn,aj)=t
- Suppose vi represents state ri, v1 is the longest proper suffix of a1a2…aj-1 that is a prefix of some keyword; v2 is the longest proper suffix of v1 that is a prefix of some keyword, and so on.
- Thus vn is the longest suffix of a1a2…aj-1 such that vnaj is a prefix of some keyword.

Properties of Algorithms 1,2,3

- Lemma 2 : The set output(s) contains y if and only if y is a keyword that is a suffix of the string representing state s.
- Proof :
- Consider a string y in output(s).
- If y is added to output(s) by algorithm 2, then y=u and y is a keyword.
- If y is added to output(s) by algorithm 3, then y is in output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.

Properties of Algorithms 1,2,3

- Lemma 3 : After the jth operating cycle, Algorithm 1 will be in state s iff s is represented by the longest suffix of a1a2…aj that is a prefix of some keyword.
- Proof : Similar to Lemma 1.
- THEOREM 1 :Algorithms 2 and 3 produce valid goto,failure, and output functions.
- Proof : By Lemmas 2 and 3.

Time Complexity of Algorithms 1, 2, and 3

- THEOREM 2 :Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n.
- From state s of depth d Algorithm 1 make d failure transitions at most in one operating cycle.
- Number of failure transitions must be at least one less than number of goto transitions.
- processing an input of length n Algorithm 1 makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.

Time Complexity of Algorithms 1, 2, and 3

- THEOREM 3 : Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.
- Proof :
- Straightforward
- THEOREM 4 : Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.
- Proof :
- Total number of executions of state← f(state) is bounded by the sum of the lengths of the keywords.
- Using linked lists to represent the output set of a state, we can execute the statement output(s) ← output(s)∪ output(f(s)) in constant time.

procedureenter(a 1 a 2 … a m):

begin

state ←0; j ← 1

whileg (state, aj )≠ fail do

begin

state ← g (state, aj)

j ← j + l

end

for p ← j until m do

begin

newstate ← newstate + 1

g (state, ap ) ← newstate

state ← newstate

end

output(state) ← { a 1 a 2 … a m}

end

begin

let r be the next state in queue

queue ← queue - {r}

for each asuch that g(r, a) = s≠fail do

begin

queue ← queue ∪ {s}

state ← f(r)

whileg (state, a) = fail dostate ← f(state)

f(s) ← g(state, a)

output(s) ←output(s) ∪ output(f(s))

end

end

end

Eliminating Failure Transitions

- Using in algorithm 1
- δ(s, a), a next move function δ such that for each state s and input symbol a.
- By using the next move function δ, we can dispense with all failure transitions, and make exactly one state transition per input character.

Algorithm 4. Construction of a deterministic finite automaton.

Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3.

Output. Next move function 8.

Method.

begin

queue ← empty

for each symbol ado

begin

δ(0, a) ← g(0, a)

ifg (0, a) ≠ 0thenqueue ← queue∪ {g (0, a) }

end

whilequeue ≠ emptydo

begin

let r be the next state in queue

queue ← queue - {r}

for each symbol a do

if g(r, a) = s ≠ faildo

begin

queue ← queue ∪ {s}

δ(r, a) ← s

end

elseδ(r, a) ←δ(f(r), a)

end

end

input symbol next state

state 0: h 1

s 3

. 0

state 1 : e 2

i 6

h 1

s 3

. 0

state 9:state7: state3 : h 4

s 3

. 0

state 5:state2 : r 8

h 1

s 3

. 0

state 6 : s 7

h 1

. 0

state 4 : e 5

i 6

h 1

s 3

. 0

state 8 : s 9

h 1

. 0

Conclusion

- Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.
- Using Next move function
- can potentially reduce state transitions by 50%, but more memory.
- Spend most time in state 0 from which there are no failure transitions.

Download Presentation

Connecting to Server..