Searching strings using the waves

Searching strings using the waves An efficient index structure for string databases Ingmar BrounsJacob Kleerekoper

Overview • Introduction • A lower bound on the edit distance • What is a wavelet? • A refinement of the lowerbound • MRS index • Searching using the MRS index • Questions

String searching • Searching in a string database S = {s1, …, sd} • Range search • Find all the substrings of S within a distance r to a search string q • The error rate is denoted by ε = r / |q| • K-Nearest neighbor search • Find the k-closest substrings of S to q

Question from Adriano • What is the query range r in ε = r / |q|? • When performing range search, r denotes the maximum value the edit distance of a substring of the database may have to be a valid result

f(s) is a frequency vector • Σ is an alphabet of σ characters • s is a string from that alphabet Σ • f(s) = [v1, …, vσ]where vi = frequency of i-th letter of Σ in s • Sum of v1, …, vσ = length of s

Example of f(s) • Σ = {A, C, G, T} • s = CTACATCGATCGATCAG • #A = 5, #C = 5, #G = 3, #T = 4 • f(s) = [5, 5, 3, 4] • Sum of v1, …, vσ = length of s • 5 + 5 + 3 + 4 = 17

Question from Laurence • “As a result of lemma 1, the transformation of a string of length n lie on the σ-1 dimensional plane that passes through the point [n, 0, …, 0] and is perpendicular to the normal vector [1, …, 1]”Why is that the case?

Answer to Laurence • Take a string s with |s| = 4 and an alphabet Σ = {A, B, C} • This string might be AAAA, BBBB, CCCC and 21 more • The corresponding f(s) are [4, 0, 0], [0, 4, 0] and [0, 0, 4]. All the f(s) span the same 2D plane in the 3D space with equation v1 + v2 + v3 = 4 which makes the normal vector [1, 1, 1] • This holds in general for every alphabet length n since v1max + … + vn max = n has always the normal vector [1, …, 1]

Edit operations on s • f(s) = [v1, ..., vσ] • Insert • vi := vi + 1 • Delete • vi := vi - 1 • Replace • vi := vi + 1 and vj := vj - 1with i ≠ j

The σ-dimensional space • The space in which all the possible points [v1, ..., vσ] exist • Take u and v points in the σ-dimensional space • We call u and v neighbors if you can obtain u from v using one edit operation

Frequency distance FD1 • Take u and v frequency vectors of two strings of the same alphabet (points in the σ-dimensional space) • The Frequency distance FD1 (u, v) between u and v is the minimum number of steps to get from u to v by jumping each step to a neighbor point

Frequency distance vs. edit distance • The edit distance ED(s1, s2) of strings s1 and s2 is the minimal number of edit operations to get from s1 to s2 • FD1(f(s1), f(s2)) ≤ ED (s1, s2) • s1 = AC, s2 = CA, f(s1) = [1,1,0,0], f(s2) = [1,1,0,0] • FD1 = 0 • ED = 2 (two replaces or one insert and one delete)

Frequency distance vs. edit distance • FD1(f(s1), f(s2)) ≤ ED (s1, s2) • Proof: • In case of a single insert or delete in ED, rule 1 or 2 is used in FD1. Now ED as well as FD1 are incremented resp. decremented by 1 • In case of an insert and a delete or a replace in the ED, the FD1 always uses rule 3: vi := vi + 1 and vj := vj – 1This will result in a lower value for the FD1 than the ED, hence the ≤ sign • So FD1 is the lower bound on ED

The lower bound of ED • Take q and s strings from alphabet Σ, r is the range (maximum ED in range search) • if r < FD1(f(q), f(s))then r < ED(q, s) • To compute ED costs O(nm) time, but FD1 costs only O(σ)

Computing FD1 • Take frequency vectors u and v of two strings of the same alphabet Σ • We collect total positive distance (pos) and total negative distance (neg) • for every letter i in σ • if ui > vi we add the difference ui – vi to pos • otherwise we add vi – ui to neg • return the maximum of pos and neg

u = [2, 10] v = [8, 1] How does it work? • u1 < v1, so add 8-2 to neg • u2 > v2, so add 10-1 to pos • Now pos = 6, neg = 9 • return 9 6 replaces and 3 inserts

Improving the lower bound • We’ve established a lower bound on the edit distance, namely de frequency distance • But we can improve this lower bound by incorpotating more information then how often letters occure. We would like to have more info about when they occur.

Wavelets • Wavelet transform • Problems with Fourier transform • Representation of frequencies in signal • But we do not know when these frequencies occur • Shows time & frequencies • Used in all sorts of signal processing (compression) JPEG2000

How does this affect us? • Suppose we have some signal • We can encode this signal by recursively taking the average of a part of the signal, and then the difference between the averages of half of this part.

Wavelets • Now with strings • AT • frequency vector (average) = [1,0,0,1] • Detail = [1,0,0,0] – [0,0,0,1] = [1,0,0,-1] • Note that we know by the detail that the first half was an A and the second half was a T.

Wavelets (Adriano) TCACTTAG TCAC TTAG TC AC TT AG G T C A C T T A [0, 0, 1, 0] [0, 0, 0, 1] [0, 1, 0, 0] [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 0, 1] [0, 0, 0, 1] [0, 0, 1, 0]

Wavelets TCACTTAG TCAC TTAG [0, 1, 0, 1] [1, 1, 0, 0] [1, 0, 1, 0] [0, 0, 0, 2] TC AC TT AG G T C A C T T A [0, 0, 0, 1] [0, 1, 0, 0] [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 0, 1] [0, 0, 0, 1] [1, 0, 0, 0] [0, 0, 1, 0]

[2, 2, 1, 3] [1, 2, 0, 1] TCACTTAG [1, 0, 1, 2] TCAC TTAG [0, 1, 0, 1] [1, 1, 0, 0] [1, 0, 1, 0] [0, 0, 0, 2] TC AC TT AG G T C A C T T A [0, 0, 0, 1] [0, 1, 0, 0] [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 0, 1] [0, 0, 0, 1] [1, 0, 0, 0] [0, 0, 1, 0]

[2, 2, 1, 3] [1, 2, 0, 1] [1, 0, 1, 2] TCACTTAG TCAC TTAG [0, 1, 0, 1] [0, -1, 0, 1] [1, 1, 0, 0] [1, 0, 1, 0] [0, 0, 0, 2] TC AC TT AG G T C A C T T A [0, 0, 0, 1] [0, 1, 0, 0] [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 0, 1] [0, 0, 0, 1] [1, 0, 0, 0] [0, 0, 1, 0]

[2, 2, 1, 3] [0, 2, -1, -1] [1, 2, 0, 1] [-1, 0, 0, 1] [1, 0, 1, 2] [-1, 0, -1, 2] TCACTTAG TCAC TTAG [1, 1, 0, 0] [1, -1, 0, 0] [0, 1, 0, 1] [0, -1, 0, 1] [1, 0, 1, 0] [1, 0, -1, 0] [0, 0, 0, 2] [1, -1, 0, 0] TC AC TT AG G T C A C T T A [0, 0, 0, 1] [0, 1, 0, 0] [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 0, 1] [0, 0, 0, 1] [1, 0, 0, 0] [0, 0, 1, 0]

[2, 2, 1, 3] [0, 2, -1, -1] [-1, 0, 0, 1] [-1, 0, -1, 2] TCACTTAG TCAC TTAG [0, -1, 1, 0] [1, -1, 0, 0] [1, 0, -1, 0] [1, -1, 0, 0] TC AC TT AG G T C A C T T A

The kth wavelet transformation • Definition 4 Let s=c0,...cn-1 be a string from the alphabet {α1,.., ασ}, then kth-level wavelet transformation, ψk(s), 0 ≤ k ≤ log2n, of s is defined as: • ψk(s) = [vk,0,..,vk,n/(2^k)-1] where vk,i = [Ak,i,Bk,i]

The 0th wavelet transformation • The 0th wavelet transformation defines the original string • ψk(s) = [v0,0,..,v0,(n/1)-1] where vk,i = [Ak,i,Bk,i] • For TCACTTAG that is • V0,0,..,V0,7 • A0,0,..A0,7

The log2n wavelet transformation • In the article they only chose to use the first and second wavelet coefficient, this corresponds to the log2n wavelet transformation. • ψk(s) = [vk,0,..,vk,n/(2^k)-1] so only vlog(n),0 • For TCACTTAG that is v3,0 • A3,0= A2,0+A2,1 • A2,0 = A1,0 + A1,1 , A2,1 = A1,2 + A1,3 • A1,0 = A0,0 + A0,1 , A1,1 = A0,2 + A0,3 etc • A3,0=[2,2,1,3] • B3,0 = A2,0 – A2,1 = [0,2,-1,-1]

Theorem 3 (Bogdan) • String S with coefficients [A,B] • Where A=[a1,..,aσ] and B = [b1,..,bσ] • How can an edit operation influence A and B • Replace first half & second half • ai:= ai+1, aj:= aj-1, bi:= bi+1 , bj:= bj-1 • ai:= ai+1, aj:= aj-1, bi:= bi-1 , bj:= bj+1 • Delete & Insert • ai:= ai+-1 , bj:= bj+-1

Theorem 3 • Delete on string of even length • ai:= ai+-1 , bi:= bi-1 , bj:= bj+2 • AABA , A=[3,1] B=[1,-1] • ABA A=[2,1] B=[0,1] • Insert on string of odd length • ai:= ai+-1 , bi:= bi+1 , bj:= bj-2 • ABA A=[2,1] B=[0,1] • AABA , A=[3,1] B=[1,-1]

The lower bound • So if we have ψ(si) and ψ(sj), the five steps listed at theorem two can be used to walk from ψ(si) to ψ(sj). (These are two points in 2σ dimensional space) • So now the FD2(ψ(si), ψ(sj)) is the shortest legal path using these steps from ψ(si) to ψ(sj)

The MRS index structure • A table of trees Ti,j with index structure • A column stands for string sj of database S = {s1, ..., sd} with 1 < j < d • A row stands for a resolution (or window-size) 2i with a < i < a + l -1 and l the number of resolution-levels in the index • Each tree Ti,j consists of several Minimum Bounding Rectangles, MBR's, containing several wavelet-coefficients depending on the given capacity c of the MBR

Building the MRS index (1) • Take string s1 = CTAGTCGA • Let's build the tree T2,1, given c = 3 • window-size w = 4 (= 2i)string-number j = 1 • Take a window of size w and slide along s1 • The first MBR contains the 1st and 2nd coefficient of the first c substrings in the window:{φ(CTAG), φ(TAGT), φ(AGTC)}next MBR: {φ(GTCG), φ(TCGA)}

Building the MRS index (2) • s1 = CTAGTCGA, c = 3 • So T2,1 = {φ(CTAG), φ(TAGT), φ(AGTC)}, {φ(GTCG), φ(TCGA)} • Next T3,1 = {φ(CTAGTCGA)} and consist of only 1 MBR • Normally sj is much bigger than a + l - 1 (the maximum resolution) • Et cetera

Some remarks on the index structure • Search for a subquery of length 2i?Just take row Ri = {Ti,1, ..., Ti, d} of the table • Take a query string q and a MBR B:FD(q, B) is the minimum of all the FD(q, s) where s Є B, soif r ≤ FD(q, B) then r ≤ FD(q, s) for all s Є B • Wavelet coefficients of substrings obtained by sliding the window are very close to each other, so the set of coefficients in an MBR are highly clustered.

Range Queries • We are searching for all sequences that are within an edit distance of r from the query string • Easy case: the index contains a resolution that exactly fits the size of the query string

Range queries • Query string has length 2a • For the corresponding row in the index, for every database sequence, we compute the FD of the query string to the MBR’s. • If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B • If r < FD(q,B) then r < ED(q,s) for every s elem B • So if FD(q,B) > r , then we drop B

Range queries • However we may have some false positives. • If r > FD(q,s) this does not guarantee that r > ED(q,s) for every s elem B • Thats why we have to post process all strings that are in the boxes for which r > FD(q,B). (e.g. By dynamic programming)

Range queries • Now what if there is no row with a resolution corresponding to the query string. • We partition the query string • We take the longest possible suffix such that the resolution exists in the index • We continue doing this interatively, so we get q1,q2,..,qt

Range queries

Nearest neighbour queries • Given some query we search for the k closest substrings in the database. • Phase 1 • Lookup the set of k closest MBRs to the query • r1 is the kth smallest edit distance to strings in the set • Phase 2 • RangeSearch(q,r1) • Return the k closest strings • Why phase 2 ? • FDbox10 ≤FDbox11,FD10≤ ED10, FD11 ≤ ED11 • However this does not guarantee that ED10 ≤ ED11

Questions (Peter) • It is nice that they can prove that the MRS index does not incur any false drops (Theorem 4), but is this also true in a practical sense? • If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B • If r < FD(q,B) then r < ED(q,s) for every s elem B • So if FD(q,B) > r , then we drop B

Questions (Lee) • The article focuses on substring matching. What adaptations would we need for whole matching? • Determine [A,B] for every entire string in the DB • Determine FD of q to each string

Questions (Bogdan) • The definition of FD(q,B) in section 3.3 says that the distance between a query transformation and a box is the minimum of the distances between the query transformation and the transformations in that box. It is also mentioned in the same section that for each box (MBR) only the lower/higher end points and the starting location of the first substring contained in that MBR are stored as part of the index • Further on, in section 4.4, a part of the range query algorithm implies the computation of FD(q,B) for various (query, MBR) pairs. However, since we only have the lower/higher end points for each MBR, how is it possible to compute FD(q,B) without retrieving all the substrings s_i that are in the box B from the disk • I could think of alternatively defining FD(q,B) with a formula involving only the lower/higher end points of the box B, but this is not what the authors are suggesting/using.

Questions(Bogdan)

Searching strings using the waves

Searching strings using the waves

Presentation Transcript

Searching CiteSeer Metadata Using Nutch

Searching for Gravitational Waves with Millisecond Pulsars:

Music – The Strings

Searching for gravitational waves with lasers (LIGO)

Searching for gravitational waves with lasers

Waves in Strings and Pipes

The Strings Family

Homology searching using heuristic methods

Searching for gravitational waves from known pulsars

Searching for the fifth dimension using gravitational waves

Strings

Searching for gravitational waves with LIGO

PHP Using Strings

Searching for Alfvén Waves

Strings in AdS pp-waves

Searching Articles Using Education Databases

Strings in AdS pp-waves

PHP Using Strings

Searching for Alfvén Waves

PHP Using Strings

Searching and Strings

Birth and development of the experiments searching for gravitational waves