Loading in 2 Seconds...

Pattern Matching Using n -grams With Algebraic Signatures

Loading in 2 Seconds...

- 118 Views
- Uploaded on

Download Presentation
## Pattern Matching Using n -grams With Algebraic Signatures

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Pattern Matching Using n-grams With Algebraic Signatures

### Thank You for Your Attention

Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2][1] Université Paris Dauphine[2] Santa Clara University

n-gram Search

- New pattern matching idea
- Matches algebraic signatures
- Preprocesses both :pattern & string (record)
- String preprocessing is a new idea
- To the best of our knowledge
- Provides incidental protection of stored data
- Important for P2P & grid systems
- Fast processing
- Especially useful for DBs & longer patterns
- ASCII, Unicode, DNA…
- Should be then often faster than Boyer-Moore
- Possibly the fastest known in this context

Algebraic Signature

- Symbols of the alphabet are elements of a Galois Field
- GF (256) usually
- We choose there one primitive element
- Usually = 2
- The algebraic signature of the string of i symbols p1… piis the sum:

p’i= p1+…+pii.

- Here the addition and the multiplication are the operations in GF.

Algebraic Signature

- In our GF (2f) where f = 8,16:

p + q = p – q = p XOR q

- One method for multiplying is :

p*q = antilog (( log p + log q) mod 255)

- The division is then :

p / q = antilog (( log p - log q) mod 255)

- The log and antilog are encoded in log and antilog tables with 2f elements each.
- Entry 0 is for element 0 of the GF and is by convention set to 2f - 1.

Cumulative Algebraic Signature

- We encode every symbol piin a string into the signature of the prefix p1…pi
- The value of a CAS symbol now encodes also the knowledge of values of all the previous ones
- Matching a single symbol means prefix matching

Application of CASs

- Protection against involuntary data disclosure
- On P2P & Grid Servers especially
- Numerous CAS encoded string matching algorithms
- Prefix match with O (1) complexity
- Pattern match by signature only
- Karp – Rabin like, linearO (L) complexity
- Longest common string search
- Longest common prefix search
- …

CAS Properties

- O (K) encoding and decoding speed
- For encoding, for instance:

p’i= p’i-1 + pi i= CAS ( pi-1) + pi i

- Fast n – gram signature calculus
- For Sk,l =pk…plwith k > 1 and l – k = n :

AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1

- Logarithmic Algebraic Signature (LAS)

LAS ( Sk,l ) = log AS ( Sk,l ) =

= ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1

The n-gram SearchKey ideas

- Design a sublinear pattern match search
- With speed about L / K
- Apply to CAS encoded DB
- New idea for string search algorithm with preprocessing
- Justified for a DB
- Store once, search many times

The n-gram SearchKey ideas

- Preprocess the pattern to create a jump table
- As in Boyer – Moore
- Use n –grams with n > 1 to increase the discriminative power of an attempt
- Comparisonof a sample from the pattern
- a single symbol for BM
- an LAS of an n – gram for a CAS-encoded string

The n-gram SearchKey ideas

- If the alphabet uses m symbols, the probability that a symbol matches is 1/m
- Assuming all symbols equally likely
- For usual ASCII pattern matching m = 20-25
- For DNA m = 4
- A single symbol may often match without the whole pattern matching
- e.g., ¼ times for DNA on the average
- Leading to small jumps,
- by m symbols on the average

The n-gram SearchKey ideas

- The probability of an n - gram matching may be :

min ( 1/ 2f , 1 / mn )

- In our examples it can reach 1 / 256
- More discriminative sampling
- Longer jumps
- By almost K or 256 symbols in general
- Useful for longer strings
- DNA, text, images…

The n-gram Search Preprocessing

- Encode every record (string) into its CAS
- Done for incidental protection anyhow for SDDS-2006
- Encode the terminal n - gram of the searched pattern SKintoits LAS in variable V
- Fill up the jump table T for every other n - gram in SK
- calculate every LAS
- for each LAS, store in T its rightmost offset with respect to the end of SK

The n-gram Search Jump Table

- For GF (256), every n – gram Si, i+n-1in the pattern and i = LAS (Si, i+n-1):
- T ( i ) = the offset
- T ( i ) = K – n + 1 otherwise
- Remainder : LAS (0) = 255
- T can be also hash table
- See the paper
- Slower to use but possibly more memory efficient
- Probably more useful for a larger GF

Calculate LAS of the current n-gram in the string

Start with the n-gram SK-n+1,K

Continue depending on jump calculus

Attempt to match V

If .true then calculate LAS of the entire current possibly matching substring

of length K and ending with the current n-gram

If .true, then resolve the possible collision

Either attempt to match all the K symbols

Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value

The n-gram Search Processing Otherwise

Go to T using LAS of the n-gram

Jump by the number of symbols found in T

Update the “current” position for n-gram to attempt the match

Re-attempt the match as above

Unless the n-gram to attempt is beyond the end of the string

The n-gram Search Processingn-grams / BM

- Average shifts with n-grams can betypicallylonger
- Calculate an attempt & jump may be more expensive as well
- About twice as long at first approach
- The precise analysis remains to be done
- Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

Experimental Results

- Searching large data of:
- DNA
- Typical ASCII
- XML Documents
- Patterns of 6 to 500 symbols (bytes)
- 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors

Results Compared to BM

- DNA
- Up to 72 times faster
- Typical ASCII
- Up to about 11 times faster
- XML Documents
- Up to more than 5 times faster
- Search faster for longer pattern
- Average shifts are longer

Related Work

- Implemented in SDDS-2006
- Applies best to
- longer patterns
- where many jumps occur
- alphabets much smaller than the size of GF used
- Instead of shifts of size min the average, one reachesalmost min (K, 2f)per shift
- up to almost 256 for DNA or ASCII with GF (256)
- up to almost 64K for DNA or Unicode with GF (64K)
- instead of 4 or 25 respectively
- For Boyer-Moore especially

Related Work

- In SDDS 2006 & P2P or Grid System in general
- Wish to hide what is searched for ?
- Use the signature only based search
- Usually slower since linear only

Conclusion

- A new pattern matching algorithm
- Uses algebraic signatures
- Preprocesses both the pattern and the string
- Appears particularly efficient
- For databases
- For longer patterns
- Possibly faster in this context than any other algorithm known know
- But all this are only preliminray results

Future Work

- Performance Analysis
- Theoretical
- Jump Length
- Median, Average…
- Experimental
- Actual text
- Non uniform symbol distribution
- DNA
- Actual DNA strings

Future Work

- Variants
- Jump Table
- Partial Signatures of n –grams
- Symbol pi encodes the n –gram signature up to pi-n+1…pi
- No more XORing & Division to find this signature
- Faster unsuccessful attempt to match
- Approximate Match
- Tolerating match errors
- E.g., and at most 1 symbol

witold.litwin@dauphine.fr

Download Presentation

Connecting to Server..