1 / 36

On Minimizing Pattern Splitting in Multi-track String Matching

On Minimizing Pattern Splitting in Multi-track String Matching. Kjell Lemström and Veli Mäkinen Department of Computer Science University of Helsinki. Minimum splitting problem.

gyan
Download Presentation

On Minimizing Pattern Splitting in Multi-track String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Minimizing Pattern Splitting in Multi-track String Matching Kjell Lemström and Veli MäkinenDepartment of Computer Science University of Helsinki

  2. Minimum splitting problem • We study the following problem. Given a pattern string P and K parallel text strings Tk, 1· k · K,find the smallest integer k > 0 such that P can be split into k pieces P=P1LPk, where each Pi has an occurrence in some text track and these partial occurrences retain the order. P T1 T2 T3 On Minimizing Pattern Splitting in Multi-track String Matching

  3. Motivation • Music information retrieval. • Text tracks represent different instruments. • Finding splitted pattern occurrences allows the query-melody to jump between instruments. • Useful in Query-by-Humming applications, where the pattern is monophonic and the music in database are polyphonic. On Minimizing Pattern Splitting in Multi-track String Matching

  4. On Minimizing Pattern Splitting in Multi-track String Matching

  5. Minimum splitting problem... • We study different versions of the problem:- Gap between the occurrences of two consecutive pattern pieces is limited by a.- Length of each piece must be ¸g. - Transposition-invariant occurrences; there is an occurrence if the pattern is found with a constant c added to each character. On Minimizing Pattern Splitting in Multi-track String Matching

  6. A splitting with k=4 and transposition c=2: P4 = 4 7 8 5, P4 + c= (4+c) (7+c) (8+c) (5+c) = 6 9 10 7 P1 P2 P3 P4=k P T1 T2 T3 a g On Minimizing Pattern Splitting in Multi-track String Matching

  7. Parallel texts assumption • To represent the different tracks as parallel strings, we need to add empty characters to make the tracks aligned. • Therefore it makes more sense to consider splittings where the jumps over empty characters are not counted. 4-6--7---3--9 T= ...-5--784--2-8-... 3-3-453-8--8- P=464538289 On Minimizing Pattern Splitting in Multi-track String Matching

  8. Related work • All related work assume that texts are parallel. • The exact search (a=0), when the number of splits is not minimized, can be casted into a subset matching problem. - Running time O(Kn log2(Kn)) can be achieved using an algorithm of Cole and Hariharan, 2002.-O((Kn+mn)d|S|/we) can be achieved using bit- parallelism, see Iliopoulos and Kurokawa, 2002. On Minimizing Pattern Splitting in Multi-track String Matching

  9. Related work... • Lemström and Tarhio, 2003, have developed an efficient filter and a checking algorithm for the transposition-invariant version of the exact search problem on multi-track texts. On Minimizing Pattern Splitting in Multi-track String Matching

  10. Summary of results • Let M={(i,j,k) | pi=tkj} be the set of matching character pairs, where 1· i · m, 1· j · n, and 1· k · K. For simplicity, let us assume that the alphabet is S={1,2,...,Kn+m}, and m,K<n. • The minimum splitting problem with a > 0, and with or without the parallel text assumption, can be solved in O(m+Kn+|M|) time.- Corollary: the transposition-invariant splitting problem can be solved in O(mKn) time. On Minimizing Pattern Splitting in Multi-track String Matching

  11. Summary of results... • Let (i,j,k)(i+1,j+1,k)L(i+l-1,j+l-1,k) be a maximal sequence of points in M, i.e. a maximal (diagonal) line segment of M. Let S be the set of all maximal line segments of M. • The minimum splitting problem can be solved in O(m2+Kn+|S|log n) time. • The minimum splitting problem with a > 0 can be solved in O(m2+Kn+|R|log n) time, where |R|· min(|S|2,|M|). On Minimizing Pattern Splitting in Multi-track String Matching

  12. Summary of results... • The minimum splitting problem with a = 0 can be solved in O(m2+kKn) time, where k is a given threshold. On Minimizing Pattern Splitting in Multi-track String Matching

  13. O(|M|) algorithm • The idea is to compute an m£ n £ K matrix sparsely, so that each computed cell di,j,k stores the minimum splitting needed between P1...i and the text tracks upto tkj. The recurrence is On Minimizing Pattern Splitting in Multi-track String Matching

  14. On Minimizing Pattern Splitting in Multi-track String Matching

  15. O(|M|) algorithm... • Initializing d1,j,k = 0 for each (1,j,k) 2 M, we have that k = 1+min{dm,j,k | (m,j,k) 2 M}. • It is easy to construct M so that diagonally consecutive elements are linked to enable constant time evaluation of line (1) of the recurrence. • Evaluating M column-by-column, rows bottom to up, we can maintain the minimum value at each row to enable constant time evaluation of line (2) of the recurrence. On Minimizing Pattern Splitting in Multi-track String Matching

  16. O(|M|) algorithm... • To solve case a > 0, we use a technique from Crochemore et al., 2002; keep sliding window minima at each row during column-by-column evaluation. • Min-deques (Gajewska and Tarjan, 1986) support constant time access to the minimum value in a list as well as insertion to the tail and deletion from the head of the list. On Minimizing Pattern Splitting in Multi-track String Matching

  17. O(|M|) algorithm... • Each step of the algorithm takes constant amortized time. Thus the overall running time is O(|M|). On Minimizing Pattern Splitting in Multi-track String Matching

  18. Transposition-invariance • In Navarro et al., 2003, the following connection between sparse dynamic programming and transposition-invariance was given. • Lemma: Let d(P,T) be a distance between strings P and T such that its value is determined by the set M={(i,j) | pi=tj}. If an algorithm computes d(P,T) in O(|M| f(m,n)) time, then the transposition invariant distance can be computed in O(mn f(m,n)) time. On Minimizing Pattern Splitting in Multi-track String Matching

  19. Transposition-invariance... • In our problem, the relevant match sets for transposition invariant computation are the non-empty Mc={(i,j,k) | pi+c=tkj} for c 2 [-1,1]. • We can construct them all in O(mKn) time with pointers between diagonally consecutive elements in each set. • For each set we need O(|Mc|) time computation, which is O(mKn) overall. On Minimizing Pattern Splitting in Multi-track String Matching

  20. Line segment algorithms • We will now show how to solve the minimum splitting problem doing computation only at the endpoints of the line segments of S. • After that, the construction of S is given and the solution to the case a = 0. • In the sequel, we assume a single track text for simplicity. On Minimizing Pattern Splitting in Multi-track String Matching

  21. Interpretation as a minimum jump distance On Minimizing Pattern Splitting in Multi-track String Matching

  22. Interpretation as a minimum jump distance • We denote the two endpoints of a line segment S by start(S), end(S) 2 M. • Let minimum jump distance d((i,j)) to (i,j) 2S be the number of horizontal jumps (from (i’-1,j’’) 2 S’’ to (i’,j’) 2 S’, j’<j, S’’,S’ 2S) needed for traversing through line segments of S from row 1 to (i,j). • Then di,j = 1+d((i,j)), where di,j denotes the minimum splitting upto (i,j). On Minimizing Pattern Splitting in Multi-track String Matching

  23. Interpretation as a minimum jump distance... • Lemma: The minimum jump distanced(end(S))equalsd(start(S)). Let us denote this value d(S). • Idea of the algorithm: Traverse the endpoints of the line segments row-by-row. Keep the active segments (those intersecting previous row) in a balanced binary search tree with the diagonal numbers as the keys. Maintain subtree minima of d(S) values to answer range minimum queries [-1,j-i). On Minimizing Pattern Splitting in Multi-track String Matching

  24. Interpretation as a minimum jump distance... • The required operations on binary search tree can be supported in O(log n) time. • Thus, the algorithm works in O(|S|log n) time. On Minimizing Pattern Splitting in Multi-track String Matching

  25. Minimum splitting with a>0 • One can prove that it is enough to recompute the values of line segments only in their intersections with the so called a-greedy paths. • With some care in the implementation, one gets time bound O(m2+Kn+|R|log n), where |R|· min(|S|2,|M|). On Minimizing Pattern Splitting in Multi-track String Matching

  26. Minimum splitting with a>0... On Minimizing Pattern Splitting in Multi-track String Matching

  27. Constructing S • We will give a more general algorithm that constructs set Sg, i.e., the set of maximal line segments of length at leastg. • Let Prefix(A,B) denote the length of the longest common prefix of strings A and B. • Let MaxPrefix(j) be max{Prefix(Pi...m,Tj...n) | 1· i · m} and H(j) some index i giving the maximum. A=aaabbbb B=aaabcbb Prefix(A,B)=aaab On Minimizing Pattern Splitting in Multi-track String Matching

  28. Constructing S... • Let Jump(i,j) denote Prefix(Pi...m,Tj...n). • Lemma (Ukkonen and Wood, 1993): Jump(i,j)=min(MaxPrefix(j),Prefix(Pi...m,PH(j)...m). • Ukkonen and Wood show how to allow constant access to any Jump(i,j) value after O(m2+n) time preprocessing. • Observation: If we manage to callJump(i,j)only at points(i,j)=start(S), S 2Sg, we have anO(|Sg|)construction algorithm forSg. On Minimizing Pattern Splitting in Multi-track String Matching

  29. Constructing S... • To find points (i,j)=start(S), S 2Sg, we construct the suffix array A of P. • We make a copy As of A for each distinct character s of P. Then we remove from each As the suffixes i such that pi-1=s. • Now, if we query Tj...j+g-1 from the suffix array As where s = tj-1(or from A if As does not exist), the resulting positions of P give the line segments that start at column j. On Minimizing Pattern Splitting in Multi-track String Matching

  30. Constructing S... • If we associate all suffix arrays with LCP values, the overall complexity of constructing Sg is O(m2+Kn(g+log m)+|Sg|). • Using suffix trees instead gives a bound O(m2|P|+Kng+|Sg|) • A more direct approach gives O(m2+Kn+|S|) for the case g = 1. On Minimizing Pattern Splitting in Multi-track String Matching

  31. Minimum splitting with a=0 • Fact: Let there be a splitting of the pattern into k pieces, starting at position j of the multi-track text, without gaps between the partial occurrences. Then there is an equally good occurrence that can be found as follows: Select track Tk whose jth suffix has the longest common prefix, say length l, with the pattern. Iterate the same algorithm from position j+l with pattern suffix Pl+1...m, until a splitting into k pieces is found. On Minimizing Pattern Splitting in Multi-track String Matching

  32. Minimum splitting with a=0... • In the above algorithm, we need k queries to Jump(i,j) for each track at each position j. • Thus, after O(m2+Kn) preprocessing for Jump(i,j) queries, the problem can be solved in O(kKn) time. On Minimizing Pattern Splitting in Multi-track String Matching

  33. Implementation • We implemented the O(|M|) time algorithm, with the aforementioned skipping of empty characters. • Instead of using min-deques to support sliding window minima computation, we used a modification of the linear time construction of Cartesian trees (simple to implement and fast in practice). • The algorithm is plugged into the C-BRAHMS music search engine. On Minimizing Pattern Splitting in Multi-track String Matching

  34. http://www.cs.helsinki.fi/group/cbrahms/demoengine/ On Minimizing Pattern Splitting in Multi-track String Matching

  35. Extension and open problems • The O(|M|) and O(|S|log n) algorithms can be extended to the case where the cost is the sum of the lengths of the gaps between the partial occurrences. • Open: Computation in the case of the g restriction on the lengths of the partial occurrences. • Open: Can one achieve O(m+Kn+|Sg|) time for constructing Sg? On Minimizing Pattern Splitting in Multi-track String Matching

More Related