pattern matching using n grams with algebraic signatures n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pattern Matching Using n -grams With Algebraic Signatures PowerPoint Presentation
Download Presentation
Pattern Matching Using n -grams With Algebraic Signatures

Loading in 2 Seconds...

play fullscreen
1 / 32

Pattern Matching Using n -grams With Algebraic Signatures - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Pattern Matching Using n -grams With Algebraic Signatures. Witold Litwin [1] , Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz [2] [1] Université Paris Dauphine [2] Santa Clara University. n -gram Search. New pattern matching idea Matches algebraic signatures

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pattern Matching Using n -grams With Algebraic Signatures' - camilla-hess


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
pattern matching using n grams with algebraic signatures

Pattern Matching Using n-grams With Algebraic Signatures

Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2][1] Université Paris Dauphine[2] Santa Clara University

n gram search
n-gram Search
  • New pattern matching idea
  • Matches algebraic signatures
  • Preprocesses both :pattern & string (record)
    • String preprocessing is a new idea
      • To the best of our knowledge
  • Provides incidental protection of stored data
      • Important for P2P & grid systems
  • Fast processing
  • Especially useful for DBs & longer patterns
    • ASCII, Unicode, DNA…
    • Should be then often faster than Boyer-Moore
    • Possibly the fastest known in this context
algebraic signature
Algebraic Signature
  • Symbols of the alphabet are elements of a Galois Field
    • GF (256) usually
  • We choose there one primitive element 
    • Usually  = 2
  • The algebraic signature of the string of i symbols p1… piis the sum:

p’i= p1+…+pii.

  • Here the addition and the multiplication are the operations in GF.
algebraic signature1
Algebraic Signature
  • In our GF (2f) where f = 8,16:

p + q = p – q = p XOR q

  • One method for multiplying is :

p*q = antilog (( log p + log q) mod 255)

  • The division is then :

p / q = antilog (( log p - log q) mod 255)

  • The log and antilog are encoded in log and antilog tables with 2f elements each.
    • Entry 0 is for element 0 of the GF and is by convention set to 2f - 1.
cumulative algebraic signature
Cumulative Algebraic Signature
  • We encode every symbol piin a string into the signature of the prefix p1…pi
  • The value of a CAS symbol now encodes also the knowledge of values of all the previous ones
  • Matching a single symbol means prefix matching
application of cass
Application of CASs
  • Protection against involuntary data disclosure
  • On P2P & Grid Servers especially
  • Numerous CAS encoded string matching algorithms
    • Prefix match with O (1) complexity
    • Pattern match by signature only
      • Karp – Rabin like, linearO (L) complexity
    • Longest common string search
    • Longest common prefix search
cas properties
CAS Properties
  • O (K) encoding and decoding speed
  • For encoding, for instance:

p’i= p’i-1 + pi  i= CAS ( pi-1) + pi i

  • Fast n – gram signature calculus
    • For Sk,l =pk…plwith k > 1 and l – k = n :

AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1

  • Logarithmic Algebraic Signature (LAS)

LAS ( Sk,l ) = log AS ( Sk,l ) =

= ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1

the n gram search key ideas
The n-gram SearchKey ideas
  • Design a sublinear pattern match search
    • With speed about L / K
  • Apply to CAS encoded DB
    • New idea for string search algorithm with preprocessing
    • Justified for a DB
      • Store once, search many times
the n gram search key ideas1
The n-gram SearchKey ideas
  • Preprocess the pattern to create a jump table
    • As in Boyer – Moore
  • Use n –grams with n > 1 to increase the discriminative power of an attempt
    • Comparisonof a sample from the pattern
      • a single symbol for BM
      • an LAS of an n – gram for a CAS-encoded string
the n gram search key ideas2
The n-gram SearchKey ideas
  • If the alphabet uses m symbols, the probability that a symbol matches is 1/m
    • Assuming all symbols equally likely
  • For usual ASCII pattern matching m = 20-25
  • For DNA m = 4
  • A single symbol may often match without the whole pattern matching
  • e.g., ¼ times for DNA on the average
  • Leading to small jumps,
    • by m symbols on the average
the n gram search key ideas3
The n-gram SearchKey ideas
  • The probability of an n - gram matching may be :

min ( 1/ 2f , 1 / mn )

  • In our examples it can reach 1 / 256
    • More discriminative sampling
    • Longer jumps
      • By almost K or 256 symbols in general
  • Useful for longer strings
    • DNA, text, images…
ascii exemple usual alphabet
ASCII ExempleUsual Alphabet

2-grams => 5 jumps

1-gram => 6 jumps

dna exemple 4 letter alphabet
DNA Exemple4-letter Alphabet

3 jumps

4 jumps

4 jumps

11 jumps

the n gram search preprocessing
The n-gram Search Preprocessing
  • Encode every record (string) into its CAS
    • Done for incidental protection anyhow for SDDS-2006
  • Encode the terminal n - gram of the searched pattern SKintoits LAS in variable V
  • Fill up the jump table T for every other n - gram in SK
    • calculate every LAS
    • for each LAS, store in T its rightmost offset with respect to the end of SK
the n gram search jump table
The n-gram Search Jump Table
  • For GF (256), every n – gram Si, i+n-1in the pattern and i = LAS (Si, i+n-1):
    • T ( i ) = the offset
    • T ( i ) = K – n + 1 otherwise
  • Remainder : LAS (0) = 255
  • T can be also hash table
    • See the paper
    • Slower to use but possibly more memory efficient
      • Probably more useful for a larger GF
ascii exemple
ASCII Exemple

Dauphine

0

7

1

7

in’’

1

V = ne’’

au’’

5

ph’’

3

Notation :

xy’’ = LAS (xy)

255

7

the n gram search processing
Calculate LAS of the current n-gram in the string

Start with the n-gram SK-n+1,K

Continue depending on jump calculus

Attempt to match V

If .true then calculate LAS of the entire current possibly matching substring

of length K and ending with the current n-gram

If .true, then resolve the possible collision

Either attempt to match all the K symbols

Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value

The n-gram Search Processing
the n gram search processing1
Otherwise

Go to T using LAS of the n-gram

Jump by the number of symbols found in T

Update the “current” position for n-gram to attempt the match

Re-attempt the match as above

Unless the n-gram to attempt is beyond the end of the string

The n-gram Search Processing
ascii exemple again
ASCII Exemple Again

2-grams => 5 jumps

1-gram => 6 jumps

dna exemple again
DNA Exemple Again

3 jumps

4 jumps

4 jumps

11 jumps

n grams bm
n-grams / BM
  • Average shifts with n-grams can betypicallylonger
  • Calculate an attempt & jump may be more expensive as well
    • About twice as long at first approach
    • The precise analysis remains to be done
  • Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.
experimental results
Experimental Results
  • Searching large data of:
    • DNA
    • Typical ASCII
    • XML Documents
  • Patterns of 6 to 500 symbols (bytes)
  • 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors
results compared to bm
Results Compared to BM
  • DNA
      • Up to 72 times faster
  • Typical ASCII
      • Up to about 11 times faster
  • XML Documents
      • Up to more than 5 times faster
  • Search faster for longer pattern
    • Average shifts are longer
related work
Related Work
  • Implemented in SDDS-2006
  • Applies best to
    • longer patterns
      • where many jumps occur
    • alphabets much smaller than the size of GF used
  • Instead of shifts of size min the average, one reachesalmost min (K, 2f)per shift
    • up to almost 256 for DNA or ASCII with GF (256)
    • up to almost 64K for DNA or Unicode with GF (64K)
      • instead of 4 or 25 respectively
    • For Boyer-Moore especially
related work1
Related Work
  • In SDDS 2006 & P2P or Grid System in general
  • Wish to hide what is searched for ?
      • Use the signature only based search
        • Usually slower since linear only
conclusion
Conclusion
  • A new pattern matching algorithm
  • Uses algebraic signatures
  • Preprocesses both the pattern and the string
  • Appears particularly efficient
    • For databases
    • For longer patterns
  • Possibly faster in this context than any other algorithm known know
  • But all this are only preliminray results
future work
Future Work
  • Performance Analysis
    • Theoretical
      • Jump Length
        • Median, Average…
    • Experimental
      • Actual text
        • Non uniform symbol distribution
      • DNA
        • Actual DNA strings
future work1
Future Work
  • Variants
    • Jump Table
    • Partial Signatures of n –grams
      • Symbol pi encodes the n –gram signature up to pi-n+1…pi
        • No more XORing & Division to find this signature
        • Faster unsuccessful attempt to match
    • Approximate Match
      • Tolerating match errors
        • E.g., and at most 1 symbol
thank you for your attention

Thank You for Your Attention

witold.litwin@dauphine.fr