efficient string matching an aid to bibliographic search n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient String Matching : An Aid to Bibliographic Search PowerPoint Presentation
Download Presentation
Efficient String Matching : An Aid to Bibliographic Search

Loading in 2 Seconds...

play fullscreen
1 / 34

Efficient String Matching : An Aid to Bibliographic Search - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Efficient String Matching : An Aid to Bibliographic Search. Alfred V. Aho and Margaret J. Corasick Bell Laboratories. Virus Definition. Each virus has its peculiar signature Example in ClamAV _0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient String Matching : An Aid to Bibliographic Search' - mairi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient string matching an aid to bibliographic search

Efficient String Matching : An Aid to Bibliographic Search

Alfred V. Aho and Margaret J. Corasick

Bell Laboratories

virus definition
Virus Definition
  • Each virus has its peculiar signature
    • Example in ClamAV
      • _0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed
      • _0017_0001_000 virus index
      • Hex(21)=Dec(33)=‘!’
    • Match the signature for detecting virus
regular expression
Regular Expression
  • Use RE to describe the signature
  • ? can be any one char
    • W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff00000000
  • * can be any chars (including no char)
    • Oror-fam (Clam)=495243*56697275*53455859330f5455*4b617a61*536e617073686f
  • {n1-n2}, there are n1~n2 chars between two parts
    • Worm.Bagle.AG-empty (Clam)=6e74656e742d547970653a206170706c69636174696f6e2f6f637465742d73747265616d3b{40-130}2d2d2d2d2d2d2d2d
introduction
Introduction
  • Locate all occurrences of any of a finite number of keywords in a string of text.
  • Consists of two parts :
    • constructing a finite state pattern matching machine from the keywords
    • using the pattern matching machine to process the text string in a single pass.
pattern matching machine 1
Pattern Matching Machine(1)
  • Our problem is to locate and identify all substrings of x which are keywords in K.
    • K : K={y1,y2,…,yk} be a finite set of strings which we shall call keywords
    • x : x is an arbitrary string which we shall call the text string.
  • The behavior of the pattern matching machine is dictated by three functions: a goto function g, a failure function f, and an output function output.
pattern matching machine 2
Pattern Matching Machine(2)
  • g (s,a) = s’ or fail:maps a pair consisting of a state and an input symbol into a state or the message fail.
  • f (s) = s’:maps a state into a state, and is consulted whenever the goto function reports fail.
  • output (s) = keywords:associating a set of keyword (possibly empty) with every state.
slide9

Start state is state 0.

  • Let s be the current state and a the current symbol of the input string x.
  • Operating cycle
    • If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.
    • If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.
example
Example
  • Text: u s h e r s

State: 0 0 3 4 5 8 9

2

  • In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)
example cont d
Example Cont’d
  • In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.
  • Since g(5,r)=fail, M enters state 2=f(5) . Then since g(2,r)=8, M enters state 8 and advances to the next input symbol.
  • No output is generated in this operating cycle.
slide12

Algorithm 1. Pattern matching machine.

Input. A text string x = a1 a2 … a nwhere each a iis an input symbol

and a pattern matching machine M with goto function g, failure

function f, and output function output, as described above.

Output. Locations at which keywords occur in x.

Method.

begin

state ←0

fori ← 1 until n do

begin

whileg (state, a i ) = fail dostate ← f(state)

state ← g (state, a i )

if output (state)≠ empty then

begin

printi

print output (state)

end

end

end

construction the functions
Construction the functions
  • Two part to the construction
    • First:Determine the states and the goto function.
    • Second:Compute the failure function.
    • Output function start at first, complete at second.
construction of goto function
Construction of Goto function
  • Construct a goto graph like next page.
  • New vertices and edges to the graph, starting at the start state.
  • Add new edges only when necessary.
  • Add a loop from state 0 to state 0 on all input symbols other than the first one in each keyword.
algorithm 2
Algorithm 2

Algorithm 2. Construction of the goto function.

Input. Set of keywords K = {yl, y2, . . . . . yk}.

Output. Goto function g and a partially computed output function

output.

Method. We assume output(s) is empty when state s is first created,

and g(s, a) = fail if a is undefined or if g(s, a) has not yet

been defined. The procedure enter(y) inserts into the goto

graph a path that spells out y.

begin

newstate ← 0

for i ← 1 until k doenter(y i )

for all a such that g(0, a) = fail do g(0, a) ← 0

end

slide17

procedureenter(a 1 a 2 … a m):

begin

state ←0; j ← 1

whileg (state, aj )≠ fail do

begin

state ← g (state, aj)

j ← j + l

end

for p ← j until m do

begin

newstate ← newstate + 1

g (state, ap ) ← newstate

state ← newstate

end

output(state) ← { a 1 a 2 … a m}

end

construction of failure function
Construction of Failure function
  • Depth of s:the length of the shortest path from the start state to state s.
  • The states of depth d can be determined from the states of depth d-1.
  • Make f(s)=0 for all states s of depth 1.
construction of failure function cont d
Construction of Failure function Cont’d
  • Compute failure function for the state of depth d, each state r of depth d-1:
    • 1. If g(r,a)=fail for all a, do nothing.
    • 2. Otherwise, for each a such that g(r,a)=s, do the following:
      • a. Set state=f(r) .
      • b. Execute state ←f(state) zero or more times, until a value for state is obtained such that g(state,a)≠fail .
      • c. Set f(s)=g(state,a) .
algorithm 3
Algorithm 3

Algorithm 3. Construction of the failure function.

Input. Goto function g and output function output from Algorithm 2.

Output. Failure function fand output function output.

Method.

begin

queue ← empty

for each a such that g(0, a) = s≠0 do

begin

queue ← queue ∪ {s}

f(s) ← 0

end

slide21

while queue ≠ empty do

begin

let r be the next state in queue

queue ← queue - {r}

for each asuch that g(r, a) = s≠fail do

begin

queue ← queue ∪ {s}

state ← f(r)

whileg (state, a) = fail dostate ← f(state)

f(s) ← g(state, a)

output(s) ←output(s) ∪ output(f(s))

end

end

end

about construction
About construction
  • When we determine f(s)=s’, we merge the outputs of state s with the output of state s’.
  • In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.
  • To avoid above, we can use the deterministic finite automaton, which discuss later.
properties of algorithms 1 2 3
Properties of Algorithms 1,2,3
  • Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword.
  • Proof :
    • Suppose u=a1a2…aj, and a1a2…aj-1 represents state r, let r1,r2,…,rn be the sequence of states : 1. r1=f(r) ; 2. ri+1=f(ri) ; 3.g(ri,aj)=fail for 1≦i<n ; 4.g(rn,aj)=t
    • Suppose vi represents state ri, v1 is the longest proper suffix of a1a2…aj-1 that is a prefix of some keyword; v2 is the longest proper suffix of v1 that is a prefix of some keyword, and so on.
    • Thus vn is the longest suffix of a1a2…aj-1 such that vnaj is a prefix of some keyword.
properties of algorithms 1 2 31
Properties of Algorithms 1,2,3
  • Lemma 2 : The set output(s) contains y if and only if y is a keyword that is a suffix of the string representing state s.
  • Proof :
    • Consider a string y in output(s).
    • If y is added to output(s) by algorithm 2, then y=u and y is a keyword.
    • If y is added to output(s) by algorithm 3, then y is in output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.
properties of algorithms 1 2 32
Properties of Algorithms 1,2,3
  • Lemma 3 : After the jth operating cycle, Algorithm 1 will be in state s iff s is represented by the longest suffix of a1a2…aj that is a prefix of some keyword.
    • Proof : Similar to Lemma 1.
  • THEOREM 1 :Algorithms 2 and 3 produce valid goto,failure, and output functions.
    • Proof : By Lemmas 2 and 3.
time complexity of algorithms 1 2 and 3
Time Complexity of Algorithms 1, 2, and 3
  • THEOREM 2 :Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n.
    • From state s of depth d Algorithm 1 make d failure transitions at most in one operating cycle.
    • Number of failure transitions must be at least one less than number of goto transitions.
    • processing an input of length n Algorithm 1 makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.
time complexity of algorithms 1 2 and 31
Time Complexity of Algorithms 1, 2, and 3
  • THEOREM 3 : Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.
  • Proof :
    • Straightforward
  • THEOREM 4 : Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.
  • Proof :
    • Total number of executions of state← f(state) is bounded by the sum of the lengths of the keywords.
    • Using linked lists to represent the output set of a state, we can execute the statement output(s) ← output(s)∪ output(f(s)) in constant time.
slide28

procedureenter(a 1 a 2 … a m):

begin

state ←0; j ← 1

whileg (state, aj )≠ fail do

begin

state ← g (state, aj)

j ← j + l

end

for p ← j until m do

begin

newstate ← newstate + 1

g (state, ap ) ← newstate

state ← newstate

end

output(state) ← { a 1 a 2 … a m}

end

slide29

while queue ≠ empty do

begin

let r be the next state in queue

queue ← queue - {r}

for each asuch that g(r, a) = s≠fail do

begin

queue ← queue ∪ {s}

state ← f(r)

whileg (state, a) = fail dostate ← f(state)

f(s) ← g(state, a)

output(s) ←output(s) ∪ output(f(s))

end

end

end

eliminating failure transitions
Eliminating Failure Transitions
  • Using in algorithm 1
  • δ(s, a), a next move function δ such that for each state s and input symbol a.
  • By using the next move function δ, we can dispense with all failure transitions, and make exactly one state transition per input character.
slide31

Algorithm 4. Construction of a deterministic finite automaton.

Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3.

Output. Next move function 8.

Method.

begin

queue ← empty

for each symbol ado

begin

δ(0, a) ← g(0, a)

ifg (0, a) ≠ 0thenqueue ← queue∪ {g (0, a) }

end

whilequeue ≠ emptydo

begin

let r be the next state in queue

queue ← queue - {r}

for each symbol a do

if g(r, a) = s ≠ faildo

begin

queue ← queue ∪ {s}

δ(r, a) ← s

end

elseδ(r, a) ←δ(f(r), a)

end

end

slide32

Fig. 3. Next move function.

input symbol next state

state 0: h 1

s 3

. 0

state 1 : e 2

i 6

h 1

s 3

. 0

state 9:state7: state3 : h 4

s 3

. 0

state 5:state2 : r 8

h 1

s 3

. 0

state 6 : s 7

h 1

. 0

state 4 : e 5

i 6

h 1

s 3

. 0

state 8 : s 9

h 1

. 0

conclusion
Conclusion
  • Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.
  • Using Next move function
    • can potentially reduce state transitions by 50%, but more memory.
    • Spend most time in state 0 from which there are no failure transitions.
slide34

h

e

r

s

0

9

1

2

8

{h,s}’

i

s

6

7

s

h

e

3

4

5