Overlap Matching. By Itamar Nabriski. A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACMSIAM symposium on Discrete algorithms (2001) 279288,. Lecture Structure. Discrete Convolutions Overlap Matching Problem Definition
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
By Itamar Nabriski
A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACMSIAM symposium on Discrete algorithms (2001) 279288,.
Let T be a function whose domain is {0,…,n1}
Let P be a function whose domain is {0,…,m1}
(we’ll view them as arrays of numbers of length n and m respectively)
The Convolution of Tand P at indexj is defined as follows:
T=Ronaldinho
(n=10)
P=Deco
(m=4)
(assume, for now, each letter represents a number)
(TxP)[2]=
(TxP)[2]=
Thus Naïve Computional Time is O(m)
nD+ae+lc+do
Since
number of possible convolutions is O(n)
Naïve Approach:
For each j pay O(m) time, for a total ofO(nm)
Devious Approach:
Using the “Fast Fourier Transform” (FFT) For each j pay O(log m) time, for a total ofO(n log m)
(machine word size must be O(log m) though)
Preprocessing
Before we perform convolutions on T andPwe preprocess each letter using a constant number of constant time functions (total O(n)), retaining the running time of O(nlogm)
Using several Convolutions
For each indexj we can preform a constant amount of convolutions, retaining the running time of O(nlogm)
Postprocessing
For each index j we can use a constant time function fto format the output of the constant number of convolutions, retaining a running time of O(nlogm)
Testing For Exact Matching at index j
∑ = {a,b}
T = ababbaaa
P = abba
Thus, for example, T and P exactly match at index j = 2 :
2
Testing For Exact Matching at index j
Preprocessing functions (x is a letter):
When we write prep(S),S being a string, we mean we apply prep to all characters of S
Testing For Exact Matching at index j
Convolutions we will use:
Postprocessing function f:
For every index j,
Iff there is an exact matching of P and T at index j.
Testing For Exact Matching at index j
P = abba
prepa(P)=0110
prepb(T)=1001
T = ababbaaa
prepa(T)=10100111
prepb(T)=01011000
prepa(T) X prepa(P)[2]=
prepb(T) X prepb(P)[2]=
j =2
= 0*1+1*0+1*0+0*1=0
= 1*0+0*1+0*1+1*0=0
F(0,0) = 1 = exact match at 2
Exampli Gratia
Thus, the actual characters are not important
Input
Output
T = Franz_Beckenbauer
P = The_Kaiser
j=3
3
Overlaps (j=3 is not a valid overlap match – has an odd overlap):
Each segment can start at either an even or odd index and end at either an even or odd index
We will produce from Tfour new segmentsToo,Tee,ToeandTeo
Toowill have 1’s in the place of all characters belonging to segments that start and end at an odd indexand 0’s otherwise, for example:
3
7
Too= 0001111100
And analgously for the other segment types …
Since the pattern P tends to move around we will need to treat its segment indexes a bit differently
We will produce from Peight new segmentsPOoo,POee,POoe,POeo, PEoo,PEee,PEoeand PEeo
The big‘O’in POeemeansthe all segments in P that start and end at an even location relative to T’s index, when P is aligned to an odd index of T
(don’t worry there is an example in the next slide …)
And analgously for the other segment types …
P =
Since P is always aligned to T at some index j we treat’s P’s indexes relative to T, thus:
P =
Assume j is now odd, then for that location we will use the four PO’s:
POoo=
POee=
POeo=
POoe=
Thus for every locationj we have 16(a constant) possible number of TextPattern pairings:
{Too,Tee,Toe,Tee} × {PXoo,PXee,PXoe,PXee}
X=parity(j)
If we can determine, using convolutions, for each pairing if it only contains even overlaps we can solve the Overlap Matching problem in O(n log m) time
Case 1 occurs when for Tab, Pcd either a=c or b=d
This covers 12 of the 16cases.
We now show a solution for when a=c. This covers 8 cases, we use the solution on the reverse* strings of T and P (thus ‘a’ becomes ‘c’ and ‘b’ becomes ‘d’) to solve the 4 remaining cases.
* Computing the reverse strings does not alter the run time (do it during general preprocessing)
For every two marked segments St in Tab starting at index x and Sp in Pcdstarting at index y:
xy is always even
(since eveneven = even and oddodd = even)
We now create a convolution that will return 0 for index j iff there is no odd overlap at j
For every segment in Tab we replace the 1’s by an alrenating series of 1’s and 1’s beginning with 1.
In case where we have only even (and/or no) overlaps:
= 1  1 = 0
In case where we have at least one odd overlap:
5
6
3
1
= 1 – 1 + 1 = 1 > 0
Case 2 occurs when Toe, Peo (Teo, Poe is symmetric)
If a segment in Toeis contained in a segment inPeoor vice versathen the overlap is even, otherwise overlap is odd.
2
4
8
11
Containment Elimination Property
Convolution at index j gives zero if all overlaps are containments, otherwise it gives a positive result .
To achieve this we will actually use 3 convolutions, a combination of their output will give us the desired answer.
Fleshing Out The Solution
For each segment St in Toe that starts at index st, replace the segment’s 1’s by st,1…1,st
For each segment Sp in Peo that starts at index sp, replace the segment’s 1’s by sp,1…1,sp
3
3
Containment:
sp + (len(Sp)2) + sp = len(Sp)2
st + (len(St)2) + st = len(St)2
No Containment (overlap of length k):
k2
k2
st + (k2) + sp
sp + (k2) + st
Problem 1
The indexes of the pattern Peo change for each indexj , raising the preprocessing time to O(m) for each convolution!
Problem 2
We need to find a way to remove “The size of the overlap 2” from the resulting convolution.
Containment
len(Sp)2
0
0
len(St)2
Remove “overlap  2”
No Containment
sp  st 0<
sp + (k2) + st
st  sp 0<
st + (k2) + sp
Solving Problem 2
Perform another convolution, The “Overlap Length Convolution”subtract its value from the main convolution.
Every segment both Toe and Peois replaced by0,1,1,….1,0giving us “size of overlap 2” for each overlap.
3
3
Overlap of length 4 :
= 0+1+1+0 = 2 = “overlap 2”
Solving Problem 1
The trouble is with the pattern Peo segments whose indexes change in each index j. Instead treat the pattern segments relative to Peo. (“Zero Containment Convolution”)
3
4
T
P
4
3
T
P
1
2
0
Solving Problem 1
We created a new problem, overlap convolutions can be negative and thus the overall convolution at index j can turn out to be zero when there is an odd overlap.
7
T
P
2
= 2+17 = 4
Solving Problem 1
We want to get the benefits of both worlds. Towards that end we’ll add to the result a third convolution “The Shifting Convolution”. This simply corrects the problem caused by using the pattern indexes.
Every segment in T is replaced by 1,0…0,1 and every segment in P is replaced by 0,1…,1,0 and the result is multiplied by index j.
j2
3
7
4
T
= 2
P
0
1
2
2 * j = 2 * 2 = 4
This replenishes our “losses”
Solving Problem 1
Thus, the convolution gives 0 for each containment overlap and 1 for each noncontainment overlap.
1
2
T
= 1
P
1
2
T
= 0
P
Thus multiplying by j we return “one j” to each noncontainment overlap
Final Algorithm
Thus we implement the “Containment Elimination Property” by:
Zero Containment Convolution
+
Shifting Convolution

Overlap Length Convolution
=
Containment Elimination Property
Amazing! He’s a master of the “Shifting Convolution”
Very Powerful Technique!
Case 3 occurs when Too, Pee (Tee, Poo is symmetric)
If a segment in Toois contained in a segment inPeeor vice versathen the overlap is odd, otherwise overlap is even.
1
2
3
4
7
8
10
13
Containment:
sp + (len(Sp)2) + sp = len(Sp)2
st + (len(St)2) + st = len(St)2
No Containment (overlap of length k):
k2
k2
st + (k2) + sp
sp + (k2) + st
We’ll use the same convolution as in Case 2 and two additional ones:
Conv1: Every segment in Too of length len replace by 0,1,2,…,len1.
Replace Pee segments by 1,0,…,0.
1
1
T
P
Conv2: (Opposite of 1) Every segment in Pee of length len replace by 0,1,2,…,len1.
Replace Too segmentsby 1,0,…,0.
The first convolution gives us the length of all areas like the one marked in green:
= 3
It gives us for every two overlapping segments which St is “ahead” of Sp
If, for some overlap, the first convolution is positive the second will be zero, and vice versa.
= 0
This is true also for containments:
= 3
The convolution from Case 2 gives the same value for non containments and zero for containments.
Thus:
Conv1 + Conv2 – ConvCase2 = positive
= containments = odd overlap
Conv1 + Conv2 – ConvCase2 = 0
= no containments = onlyeven overlaps
Algorithm Final Outcome
Each Case (1,2,3) takes O(n log m) :
1. A constant number of preprocessing functions O(n)
2. A constant number of convolutions O(n log m)
3. A constant time computable function O(1)
for a total runtime of O(n log m)
Formal Definition
Let S =s1,…,s2 be a string over alphabet ∑
A swap permutation for S is a permutation
π : {1,…,n} → {1,…,n} such that:
Lemma (will not be proven):
A solution to swap matching over alphabet {a,b} of time O(f(n,m)) implies a solution of time O(log∑f(n,m)) over alphabet ∑.
And there exists an algorithm to do so.
A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern matching with swaps, J. Algorithms 37 (2) (2000) 247266.
Maximal Alternating Segment (MAS)
Proof →(by contradiction):
Assume P is aligned to T at index j and we can’t swap match and the two MAS A,B do not exist:
w.l.o.g overlapA=(ab)* overlapB=(ba)*
we can swap within the overlap boundaries and get the desired result  contradiction
Thus, there must be one MAS A,B that have a misaligned odd overlap
Proof ←(by contradiction):
Assume there exist MAS A,B that misalign in an odd overlap and P and T swap match at index j:
w.l.o.g overlapA=(ab)*a overlapB=(ba)*b
The we must swap with letters outside of the overlap but by definition of MAS this will not help and we can’t swap match. Contradiction.
Algorithm
0
2
6
7
8
9
T =
Tevena =
Todda =
Algorithm
We provide a similar construction for P :
Pevena ,Poddausing P’s index !
When matching,if the indexj of T is odd we will use one for the other(Pevenabecomes Poddaand vice versa)
0
4
Pevena =
Aligned at T’s index 3 it becomes Podda:
3
7
Algorithm
If index j is even, T swap matches P iff Tevena overlap matches Podda at j and Toddaoverlap matches Pevena at j.
If index j is odd, T swap matches P iff Tevena overlap matches Pevena at j and Toddaoverlap matches Podda at j.
Algorithm – Why does it work?
An evena MAS and an odda MAS will never exactly match:
0
evenaMAS
oddaMAS
By the lemma if their overlap is odd then swap matching is not possible and this is exactly what we examine using the Overlap Matching method
Algorithm
Thus, for an alphabet ∑ we can swap match at O(n log m log∑)
Improvement over previous deterministic upper bound of O(nm1/3log mlog∑)
A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern matching with swaps, J. Algorithms 37 (2) (2000) 247266.