exact string search n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Exact String Search PowerPoint Presentation
Download Presentation
Exact String Search

Loading in 2 Seconds...

play fullscreen
1 / 30

Exact String Search - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Exact String Search. Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005. Boyer-Moore. Method of choice for exact string search, for a single pattern Typically, examines fewer than m characters of the text (sublinear time)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Exact String Search' - brook


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
exact string search

Exact String Search

Lecture 7: September 22, 2005

Algorithms in Biosequence Analysis

Nathan Edwards - Fall, 2005

boyer moore
Boyer-Moore
  • Method of choice for exact string search, for a single pattern
    • Typically, examines fewer than m characters of the text (sublinear time)
    • Linear worst case running time
    • Conceptually very similar to K-M-P, but more complicated to running time proof
    • Empirically, better for english text than DNA sequence
boyer moore1
Boyer-Moore
  • Three key ideas
    • Right to left scan
    • Bad character rule
    • (Strong) good suffix rule
  • The combination of these ideas can produce large pattern shifts.
  • Provable O(n+m) running time when pattern is not in the text
    • need extension for case when pattern is in the text to achieve linear running time.
right to left scan bad character rule
Right to left scan / bad character rule

0 1

12345678901234567

T:xpbctbxabpqxctbpq

P: tpabxab

*^^^^

right to left scan bad character rule1
Right to left scan / bad character rule

0 1

12345678901234567

T:xpbctbxabpqxctbpq

P: tpabxab

*^^^^

P: tpabxab

*

right to left scan bad character rule2
Right to left scan / bad character rule

0 1

123456789012345678

T:xpbctbxabpqxctbpqz

P: tpabxab

*^^^^

P: tpabxab

*

P: tpabxab

bad character rule
Bad character rule

Comparing r-to-l, mismatch at i of P, k of T:

If T(k) is absent from Pshift left end of P to k+1 of T

If right-most T(k) in P is to left of i

shift pattern to align T(k) characters

Otherwise

shift pattern 1 position

right to left scan bad character rule3
Right to left scan / bad character rule

0 1

12345678901234567

T:xpbctbaabpqxctbpq

P: tpabxab

*^^

right to left scan bad character rule4
Right to left scan / bad character rule

0 1

12345678901234567

T:xpbctbaabpqxctbpq

P: tpabxab

*^^

extended bad character rule
Extended bad character rule

Comparing r-to-l, mismatch at i of P, k of T:

If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T

For right-most T(k) in P to left of i

shift pattern to align T(k) characters

Otherwise

shift pattern 1 position

right to left scan extended bad character rule
Right to left scan / extended bad character rule

0 1

12345678901234567

T:xpbctbaabpqxctbpq

P: tpabxab

*^^

right to left scan extended bad character rule1
Right to left scan / extended bad character rule

0 1

12345678901234567

T:xpbctbaabpqxctbpq

P: tpabxab

extended bad character rule1
(Extended) bad character rule
  • For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P.
  • Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions
  • For extended bad character rule, need to lookup R(x,i)
strong good suffix rule
(Strong) good suffix rule

0 1

123456789012345678

T:prstabstubabvqxrst

P: qcabdabdab

*

strong good suffix rule1
(Strong) good suffix rule

0 1

123456789012345678

T:prstabstubabvqxrst

P: qcabdabdab

*^^

P: qcabdabdab

strong good suffix rule2
(Strong) good suffix rule

0 1

123456789012345678

T:prstabstudabvqxrst

P: abdubdab

*^^^

strong good suffix rule3
(Strong) good suffix rule

0 1

123456789012345678

T:prstabstudabvqxrst

P: abdubdab

*^^^

P: abdabdab

strong good suffix rule4
(Strong) good suffix rule

Substring t of T matches suffix of P:

  • Find the right-most copy t’ in Ps.t. t’ is not a suffix of P andchar to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T
  • If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P
stong good suffix rule
(Stong) good suffix rule

Definitions:

L(i) – max j < n such that

P[i…n] matches suffix of P[1…j],

0 if no such j.

L’(i) – max j < n such that

P[i…n] matches suffix of P[1…j]

and char. before suffix ≠ P(i-1),

0 if no such j.

Weak and strong shifts for first part of good suffix rule.

computing l i
Computing L’(i)

Definition:

Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P.

compare with:

Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.

computing l i1
Computing L’(i)

Definition:

Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P.

(!) compare with:

Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.

Compute Nj(P) as Zn-j+1(reverse(P)).

computing l i2
Computing L’(i)
  • L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)
strong good suffix rule5
(Strong) good suffix rule

Definition:

l’(i) – length of the longest prefix of P that is also a suffix of P[i…n],

0 if no such prefix exists.

l’(i) – max j < (n – i + 1) s.t. Nj(P) = j

boyer moore psuedo code
Boyer-Moore psuedo code

Compute L’(i), l’(i), and R(x) for x in Σ.

k = n

while k ≤ n

i = n, h = k

while i > 0 and P(i) = T(h)

i--; h--

if i = 0

occurrence of P in T

k = k + n – l’(2)

else

If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1)

k = k + max{ 1, i - R(T(h)), n – λ }

running time analysis
Running time analysis
  • Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration.
  • Worst instance does Θ(nm) total comparisons, but only if P is in T
  • If P is not in T, O(n+m) running time
    • complicated proof!
  • What goes wrong when P is in T?
worst case instance p in t
Worst case instance, P in T

0 1

12345678901234567

T:aaaaaaaaaaaaaaaaa

P: aaaaaaa

^^^^^^^

P: aaaaaaa

^^^^^^^

galil s extention
Galil’s Extention
  • Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k.
    • skip these comparisons
  • Sufficient for linear time bound, whether or not P is in T or not.
worst case instance p in t1
Worst case instance, P in T

0 1

12345678901234567

T:aaaaaaaaaaaaaaaaa

P: aaaaaaa

^^^^^^^

P: aaaaaaa

^

galil s extention1
Galil’s Extention

0 1

123456789012345678

T:prstabstudabvqxrst

P: abdubdab

*^^^

P: abdabdab

lessons from b m
Lessons From B-M
  • Sub-linear time is possible
    • But we still need to read T from disk!
  • Bad cases require periodicity in P or T
    • matching random P with T is easy!
  • Large alphabets mean large shifts
  • Small alphabets make complicated shift data-structures possible
  • B-M better for “english” and amino-acids than for DNA.