Vakhitov Alexander Approximate Text Indexing.

Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever division of the search pattern sub-linear time is achieved by “A Hybrid Method for Approximate String Matching” G. Navarro, R. Baeza-Yates Vakhitov AlexanderApproximate Text Indexing.

The task is to find substrings from the long text T, approximately matching to our pattern P. For example, we have text T='adbc' and P='abc' (s-starting position of a substring)adbc=a+d+b+c – insertion of 'd' (s=1)dbc=(a->d) +b+c – replacement 'a' with 'd' (s=2)bc=(a)+b+c – deletion of 'a' (s=3) The Task

Errors There are 3 kinds of transformations, which make errors in initial string: insertion, replacement and deletion. If we transform S with such chnges to S', then we can transform S' to S with the same number of changes. The minimal number of deletions, insertions and replacements, needed to transform string A to string B is called edit distance between A and B (ed(A,B)). Example: ed('abc','adbc')=ed('dbc','abc')=ed('bc','abc')=1;ed('survey','surgery')=2 replace 'v' with 'g' insert 'r' surgery surgey survey

The resulting algorithm The algorithm solves the approximate string matching problem in O(nlog n) time (n is the size of text T, (0,1)), if the error level , where  is the size of the alphabet, e=2.718.., =k/m, k is the number of errors, m is the size of the pattern P.

Plan of the report • Some useful ideas & basic algorithms • The main algorithm • Analysis of the complexity of the algorithm in different cases

Dividing the pattern Lemma There are strings A and B, ed(A,B)k, and we divide A into j substrings (Ai). Then at least one of the (Ai) appear in B with at most ⌊k/j⌋ errors. We need k changes to transform A to B. Each change transforms one of the Ai, so k changes are distributed between j substrings => the average number of changes is k/j. Example: ed('he likes','they like')=3=k;A1='he ',A2='likes' => j=2;ed('he ','they ')=2; ed('likes','like')=2=⌊k/j⌋

Computing edit distance There are strings x,y. ed(x,y)=? x=x1x2...xm,y=y1y2..yn, xp,yqΣ Cij=ed(x1..xi,y1..yj) C0..|x|,0..|y| is a matrix, filled with Cij. ComputingCij: C0,j=j; Ci,0=i; Ci,j = if (xi=yj) then Ci-1,j-1 else 1+min{Ci-1,j-1,Ci-1,j,Ci,j-1} Example: x='survey',y='surgery' ... u r v ey ... u r g ery 1 2 0 1 0 1 1 1 2 green means xi<>yj red means xi=yj arrow shows the element used to sompute Cij

Edit distance in the case of text T and pattern P We need to find a substring in text T which matches P with minimal number of errors. Let x be the pattern, and y will be a text. The matching text substring can begin in every text position – so, we have to initialize C0,j with 0. The rest is left from the previous task. The algorithm can store only the last column and analyze the text incrementally from the beginning. It goes left and down through the matrix, filling it with Cij.

Examples of the matrices

Construction of the NFA Nondeterministic Finite Automaton, which is searching the text substrings, approximately matching to the pattern with k errors. It consists of k rows and m columns. Transitions: Pattern and text characters are the same (horizontal) Insert characters into pattern (vertical) Replace the pattern character with the text one (solid diagonal) Delete the pattern character (dashed diagonal)

Nondeterministic Finite Automaton This automaton is for approximate string matching for the pattern 'survey' with 2 errors

Depth-first search DEF “k-neighborhood” is the set of strings that match P with at most k errors: Uk(P)={x * : ed(x,P)k} Searching this strings in the text (without errors) can solve the problem, but |Uk(P)|=O(mkk) is quite large. We can determine which strings form Uk(P) appear in the text by traversing the text suffixtree. Here we can use the Ukt(P) set. Ukt(P) is a set of neighborhood elements which are not prefixes of others.

Algorithm for searching on the suffix tree • Starts from the root • Considers the string x incrementally • Determines when ed(P,x)k • Determines when ed(P,xy)>k for any y

Algorithm for searching on the suffix tree • Each new character of x corresponds to a new column in the matrix (adding s to x <=> updating column in O(m) time). • A match is detected when the last element of the column is  k • x cannot be extended to match P when all the values of the last column are > k

Algorithm for searching on the suffix tree (illustration)

Partitioning the pattern The cost of the suffix tree search is exponential in m and k, so it's better to perform j searches of patterns of length m/j and k/j errors – that's why we divide patterns. So, we divide our pattern into j pieces and search them using the above algorithm. Then, for each match found ending at text position i we check the text area [i-m-k..i+m+k] But the larger j, the more text positions need to be verified, and the optimal j will be found soon.

Searching pieces of the pattern Let's use NFA with depth-first search (DFS) technique (the suffixes from the suffix tree will be the input of the automaton) At first, we'll transform our NFA • Initial self-loop isn't needed (it allowed us earlier to start matching from every position of the text); • We remove the low-left triangle of our automaton, because we avoid initial insertions to the pattern • We can start matching only with k+1 first pattern characters

The changes to NFA

Using suffix array instead of suffix tree The suffix array can replace the suffix tree in our algorithm. It has less space requirements, but the time complexity should be multiplied by log n. Suffix array replaces nodes with intervals and traversing to the node is going to the interval. If there is a node and it's children, then the node interval contains children intervals.

Analysis for the algorithm: the average number of nodes at level l • For a small l, all the text suffixes (except the last l) are longer than l, so nearly n suffixes reach level l; • The maximum number of nodes in the level l is l, where=||; • We use the model of n balls randomly thrown into lurns. The average number of filled urns is l(1-(1-1/l)n)=l(1-e-(n/l))=(min{n,l})

Probability of processing a given node at depth l in the suffix tree. If lm', at least l-ktext characters must match the pattern (m’ is the pattern size), and if lm', at least m'-kpattern characters must match the text. We sum all the probabilities for different pattern prefixes: The largest term of the 1st sum is the first one:and by using Stirling's approximation we have:

Probability of processing a given node at depth l in the suffix tree. ..which is: = ()lO(1/l), where , =k/l The whole first summation is bounded by l-k times the last term, so we get (l-k)()lO(1/l)=O(()l).The first summation exponentially decreases if ()<1. It means that: >e2/(1- )2 (because e-1< /(1- )if [0,1])

Probability of processing a given node at depth l in the suffix tree. ..or, equivalently, The second summation can be also bounded by this O(()l). So the upper bound for the probability of processing a givennode at depth l in the suffix tree is O(()l). In practice, e should be replaced by c=1.09 (it was defined experimentally), because we have only founded the upper bound of the probability.

Analysis of the single pattern search in the suffix tree Using the formulas bounding the probability of matching, let's consider that in levels l: all the nodes are visited, while nodes at level l>L(k) are visited with probability O((k/l)l). Remember that the average number of visited nodes at the level l (for small l) is(min{n,l}).

Three cases of analysis

The cases of analysis (a) L(k) logn, n L(k) “small n” online search preferable, no index needed (since the total work is n); (b) m+k < logn, n > m+k“large n” the total cost independent on n;(=k/l) (c) L(k) logn m+k “intermediate n”,sublinear of n time.

Analysis of pattern partitioning We need to perform j searches and then verify all the possible matches. We also determine three cases according to previous slide: (a) j logn, n L(k/j), complexity O(n) (b) m+k < j logn, n > (m+k)/j , if error level the complexity is O(n1-log) - sublinear of n (using j=(m+k) / logn) (c) with the same as in (b) error level, using the same j, we also get sublinear complexity.

Other types of algorithms • Limited depth-first search technique determines viable prefixes (the prefixes of the possible pattern matches) and searches for them in the suffix tree (it is expensive and it cannot be implemented on the suffix array) • Filtering discard large parts of the text checking for a necessary condition (simpler than the matching condition). Most existing filters are based on finding substrings of the pattern without errors, and with big error level they can't work.

Summary & Conclusions • The splitting technique balances between traversing too many nodes of the suffix tree and verifying too many text positions • The resulting index has sublinear retrieval time (O(n)), 0<<1) if the error level is moderate • In future there can appear more exact algorithms to determine the correct number of pieces in which the pattern is divided and there are (and may appear in future) some better algorithms for verifying after matching a piece of pattern.

Vakhitov Alexander Approximate Text Indexing.