1 / 16

Languages with mismatches and an application to approximate indexing

Languages with mismatches and an application to approximate indexing. Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi. Outline. Motivations and basic definitions The languages L ( S , k , r ) The repetition index R ( S,k,r )

fallon
Download Presentation

Languages with mismatches and an application to approximate indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi

  2. Outline • Motivations and basic definitions • The languages L(S,k,r) • The repetition index R(S,k,r) • Some combinatorial properties of the repetition index • A trie based approach for approximate indexing data structures • Conclusions and related works

  3. Main motivation: Approximate String Matching It concerns the finding of strings in texts in presence of “errors” or “mismatches”. • Recovering the original signals after their transmission over noisy channels; • Finding DNA subsequences after possible mutations; • Text searching where there are typing or spelling errors; • Retrieving musical passages. It has several applications in data analysis and data retrieval, such as:

  4. Each application uses a different error model, which defines how different two strings are. Some best studied cases of error models are: • Levenshtein or edit distance[Levenshtein, 1965]: it allows us to insert, delete and substitute simple characters (with a different one) in both strings; • Hamming distance [Sankoff and Kruskal, 1983]: it allows us only substitutions; • Scoring functions: they are not distances in mathematical terms and they measure the similarity degree between two words.

  5. The distanced(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and  if no such sequence exists). The different possible operations are: 1) Insertion, 2) Deletion, 3) Substitution, 4) Transposition. We consider the Hamming distance, that allows only substitutions, which cost 1 (simplified definition). It is finite whenever |x|=|y| and it holds 0  d(x,y)  |x|. Ex.: x=acgtatct, y=aggttact Ex.:x=acgtatct,y=aggttact d(x,y)=3 (in the simplified definition)

  6. Typical approaches for finding a string x in a text S: to consider a percentage D of errors, or to fix the number k of them.Hybrid approach: to introduce a new parameter r and to allow at most k errors for any window (or factor) of length r. • Let S be a string over the alphabet Σ, and let k, r be non negative integers such that k ≤ r. A string u occurs in S at position l up to k errors in a window of size r, or simply kr-occurs in S at position l, if one of the following two conditions hold: • if |u| < rd(u, S(l, l+|u|-1)) ≤ k; • if |u| ≥ ri, 1≤ i ≤ |u|-r+1, d(u(i,i+r-1), S(l+i-1, l+i+r-2)) ≤ k. • A string u satisfying above property is a kr-occurrence of S. Let L(S,k,r) be the set of words that kr-occurs in S at position l,for some l, 1≤ l ≤ |S|-|u|+1. The parameter r introduced in the previous definition can befixed or canvary as a function of the text.

  7. Example of L(S,k,r) S=abaa • k=1, r=2 • L(S,1,2)={a,b,aa,bb,ab,ba,bb,aaa,aab,aba,abb,baa, bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab, bbba} • bbba  L(S,1,2), but bbba  L(S,1,4)

  8. The Repetition IndexR(S,k,r) of S is the smallest integer h such that all strings of length hkr-occur at most in a unique position of the text: R(S,k,r) = min{h 1 s.t . i, j, 1  i, j  |S| - h + 1, V(S(i,i+h-1),k,r) V(S(j,j+h-1),k,r)  i=j}, where V(u,k,r) is the set of all words of length |u| that have at most k errors in every window of size r with respect to u. • Remarks: • R(S,k,r) is well defined because the integer h=|S| is an element of the set above described; • If k/r 1/2 then R(S,k,r)=|S|.

  9. Let us consider the stringS = a b c d e f g h i j k l m n o a b z d e z g h z j k z m n z with k = 1 and r = 2. Example • k/r = 1/2 R(S,1,2)=|S|=30. • A word w of length R(S,1,2)-1=29 that 12-appears at position 1 and 2 is w = a c c e e g g i i k k m m o o b b d d z z h h j j z z n n

  10. Some combinatorial properties of R(S,k,r) Lemma 1: If k and S are fixed, R(S,k,r) is a non-increasing function of r; Lemma 2: If r and S are fixed, R(S,k,r) is a non-decreasing function of k; Lemma 3: If k and S are fixed and r R(S,k,r), the repetition index gets constant. Theorem If k and S are fixed, there exists only one solution to the equation r = R(S,k,r).

  11. AnIndex over a fixed text S is an abstract data type which basic set is Fact(S) and that contains operations giving access to factors of S. The principal operations are: 1)Membership: given a word x, say if xFact(S); 2)Position: given xFact(S), find the left position of its first (resp. last) occurrence in S; 3)Number of occurrences: given xFact(S), find the number of occurrences of x in S; 4)List of positions: given xFact(S), produce the list occ(x) of the occurrences of x in S. All these operations can easily be extended to the case of approximate string matching.

  12. We give the following results. • The size of this indexing data structure is linear times a polylog of the size of the text S on average, i.e. O(|S|• logk|S|). • For each word x, the time spent by our algorithms for finding thelist occ(x) of all kr-occurrences of the word x in the text S is proportional to |x|+|occ(x)| on average.

  13. Description of the indexing data structure • Build the trie T(I,k,r) that represents the set of all possible strings having length R(S,k,r) that kr-occur in the string S; • Add to any leaf of the trie T(I,k,r) an integer i that is the starting position of the kr-occurrence of Srepresented by the concatenation of the labels from the root to the leaf i.

  14. Finding all kr-occurrences of a string x in a text S • “Read” as long as possible the string x and let q the last visited node i) If q is a leaf and |x|=R(S,k,r)  return i; ii) If q is a leaf and |x|>R(S,k,r)  ifxkr-occurs in S at position i then return i else“x is a false positive” iii) If |x|<R(S,k,r) returnocc(x). The list of all kr-occurrences of x has at most one element The list of all kr-occurrences of x can have more than one element In iii) we use the Colored Range Query solution [Muthukrishnan, SODA’02].

  15. Results Proposition: The overall time for finding all kr-occurrences of a string x in a text S is O(|x|+|occ(x)|). Theorem:The k-mismatch problem on a text S over a fixed alphabet can be settled by a data structure having average size O(|S|∙logk|S|) that answers queries in time O(|x|+|occ(x)|), for any query word x.

  16. Conclusions and related work • Results of this paper are in PhD Thesis of A. Gabriele [Genuary, 2005] • Independently, M. Maass and J. Nowak gave analogous results, by using the same data structure essentially and the CRQ solution [Preprint March, 2005 – CPM June, 2005], but: - a window is not used -it is improved the analysis on the size of the data structure - the technique is extended to edit distance • It is still open to find an indexing data structure of linear times a polylog size and searching time O(|x|+|occ(x)|)

More Related