1 / 45

An Efficient Index Structure for String Databases

An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh. Department of Computer Science University of California Santa Barbara. http://www.cs.ucsb.edu/~tamer. Whole/Substring Matching Problem.

kamea
Download Presentation

An Efficient Index Structure for String Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer

  2. Whole/Substring Matching Problem • Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). database string query string

  3. String Similarity • Motivation: • Applications • Genetic sequence databases, NCBI • Text databases, spell checkers, web search. • Video databases (e.g. VIRAGE, MEDIA360) • Database size is too large. Most of the techniques available are in-memory. • Space requirement of current indexes is too large. Base Pairs (millions) Year

  4. Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion

  5. Notation • q : query string. • m,n : length of strings. • r : range query radius. •  = r/|q|: error rate.

  6. String Similarity: an example • A C T - - T A G C R I I D • A A T G A T A G -

  7. Background • Edit operations: • Insert • Delete • Replace • Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2. • Finding the edit distance is costly. • O(mn) time and space if m and n are lengths of s1 and s2 if dynamic programming is used [NW70, SW81].

  8. Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

  9. Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion

  10. Frequency Vector • Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n]. • Example: • s = AATGATAG • f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

  11. Effect of Edit Operations on Frequency Vector • Delete : decreases an entry by 1. • Insert : increases an entry by 1. • Replace : Insert + Delete • Example: • s = AATGATAG => f(s) = [4, 0, 2, 2] • (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] • (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] • (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

  12. f(s) FD1(f(q),f(s)) f(q) An Approximation to ED:Frequency Distance (FD1) • s = AATGATAG => f(s)=[4, 0, 2, 2] • q = ACTTAGC => f(q)=[2, 2, 1, 2] • pos = (4-2) + (2-1) = 3 • neg = (2-0) = 2 • FD1(f(s),f(q)) = 3 • ED(q,s) = 4 • FD1(f(s1),f(s2))=max{pos,neg}. • FD1(f(s1),f(s2)) ED(s1,s2).

  13. Frequency Set of strings 1 Set of strings 2 Distance Edit Distance An Illustration of Frequency Distance & Edit Distance v1 v2

  14. Using Local Information: Wavelet Decomposition of Strings • s = AATGATAC => f(s)=[4, 1, 1, 2] • s = AATG + ATAC = s1 + s2 • f(s1) = [2, 0, 1, 1] • f(s2) = [2, 1, 0, 1] • 1(s)= f(s1)+f(s2) = [4, 1, 1, 2] • 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

  15. Wavelet Decomposition of a String: General Idea • Ai,j = f(s(j2i : (j+1)2i-1)) • Bi,j = Ai-1,2j - Ai-1,2j+1 First wavelet coefficient Second wavelet coefficient (s)=

  16. Wavelet Decomposition & ED • Define FD(s1,s2)=max{FD1, FD2}.

  17. Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN and range queries • Experimental results • Conclusion

  18. transform MRS-Index Structure Creation s1 w=2a

  19. MRS-Index Structure Creation s1

  20. MRS-Index Structure Creation s1

  21. MRS-Index Structure Creation s1 ... slide c times c=box capacity

  22. MRS-Index Structure Creation s1 ...

  23. MRS-Index Structure Creation s1 Ta,1 ... W=2a

  24. Using Different Resolutions s1 Ta,1 ... W=2a Ta+1,1 ... W=2a+1

  25. MRS-Index Structure

  26. MRS-index properties • Relative MBR volume (Precision) decreases when • c increases. • w decreases. • MBRs are highly clustered. Box volume Box Capacity

  27. Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion

  28. 1= 2 1 3 2 208 16 64 128 Range Queries [KS01] s1 s2 sd ... ... ... ... w=24 ... ... ... ... w=25 ... ... ... ... w=26 ... ... ... ... w=27

  29. k-Nearest Neighbor Query [KSF+96, SK98] k = 3

  30. k-Nearest Neighbor Query k = 3 r = Edit distance to 3rd closest substring

  31. k-Nearest Neighbor Query r k = 3

  32. k-Nearest Neighbor Query k = 3

  33. Outline • Motivation & background • Our contribution • Experimental results • Conclusion

  34. Experimental Settings • w={128, 256, 512, 1024}. • Human chromosomes from (www.ncbi.nlm.nih.gov) • chr02, chr18, chr21, chr22 • Plotted results are from chr18 dataset. • Queries are selected from data set randomly for 512  |q|  10000. • An NFA based technique [BYN99] is implemented for comparison.

  35. Experimental Results 1:Effect of Box Capacity (10-NN)

  36. Experimental Results 2:Effect of Window Size (10-NN)

  37. Experimental Results 3:k-NN queries

  38. Experimental Results 4:Range Queries

  39. Outline • Motivation & background • Our Contribution • Experimental results • Discussion & conclusion

  40. Discussion • In-memory (index size is 1-2% of the database size). • Lossless search. • 3 to 45 times faster than NFA technique for k-NN queries. • 2 to 12 times faster than NFA technique for range queries. • Can be used to speedup any previously defined technique.

  41. Future Work • Extend to weighted edit distance and affine gaps. • Extend to local similarity (substring/substring) search. • Compare the quality of answers and speed to BLAST (lossy search). • Use as a preprocessing step to BLAST. • Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

  42. Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

  43. Related Work (Similar problems) • [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. • [Gus97] (Gusfield) exact matching, suffix trees. • [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

  44. THANK YOU

  45. B f(s) FD(f(q),f(s)) FD(f(q),B) f(q) f(q) Frequency Distance to an MBR

More Related