1 / 51

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Merging and Filtering Algorithms for Approximate String Searches. Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu. Example: a movie database. Find movies starred Schwarrzenger. In general: Gap between Queries and Data. Errors in the query

alva
Download Presentation

Efficient Merging and Filtering Algorithms for Approximate String Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

  2. Example: a movie database Find movies starred Schwarrzenger.

  3. In general: Gap between Queries and Data • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string Query: Schwarrzenger. Data :Schwarzenegger … …

  4. Data may not clean • Errors in the database: • Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S

  5. Query may include error

  6. Problem definition: approximate string searches Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ

  7. Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

  8. Example: approximate string searches Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom J. Hanks Tom Hanks … Output: strings s that satisfy ed(q,s)≤2

  9. Outline • Problem motivation • Preliminary • Grams • Inverted lists • Merge algorithms • Filtering technique • Conclusion

  10. String  Grams q-grams For example: 2-gram (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10

  11. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Inverted lists • Convert strings to gram inverted lists

  12. Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti … 1,3 Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4

  13. Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.

  14. Example • Count threshold: 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

  15. Outline • Problem motivation • Preliminary • Merge algorithms • Two previous algorithms • Our proposed three algorithms • Filtering technique • Conclusion

  16. Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip

  17. Two previous algorithms (1) Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap

  18. Example of HeapMerger [Sarawagi et al 2004] 1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

  19. Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger [Sarawagi 2004] Previous New ScanCount MergeSkip DivideSkip

  20. Two previous algorithms (2) MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists

  21. Example of MergeOpt [Sarawagi et al 2004] Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4

  22. Can we run faster?

  23. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip

  24. Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

  25. ScanCount Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

  26. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip

  27. Our new algorithms (2) MergeSkip algorithm Pop T-1 Min-heap …… Jump T-1

  28. Example of MergeSkip minHeap 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

  29. Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

  30. Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

  31. Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4

  32. Example of HeapMerger minHeap 13 13 13 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4

  33. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip

  34. Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists

  35. Size of long lists How many lists are treated as long lists? Cost: MergeOpt Binary search Long Lists Short Lists 35

  36. Size of long lists How many lists are treated as long lists? Cost: MergeSkip Binary search Long Lists Short Lists 36

  37. Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 37

  38. Empirically verification Our formula about “L” achieves the best result over other options. 38

  39. Experimental data sets Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus

  40. Performance (DBLP data) DivideSkip is the best one Running time per query with various algorithms

  41. # of elements reading (DBLP data) DivideSkip is the best one DivideSkip skips reading the most elements

  42. Outline • Problem motivation • Preliminary • Merge algorithms • Filtering technique • Length, positional filter [Gravano et al. VLDB 2001] • Filter tree • Conclusion and future work

  43. Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

  44. Positional Filtering • Positional Gram • For example: string abcd: • {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

  45. root … 1 2 3 n … aa ab zy zz 1 2 m Filter tree Length level Gram level … Position level 5 12 17 28 44 Inverted list

  46. Surprising experimental results(DBLP) Wisely use filters, more filters may be bad!

  47. Conclusion • Three newmergealgorithms • We run faster • Surprising experimental results Wisely use filters, more filters may be bad!

  48. Thank you!

  49. Backup : related work Approximate string matching [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]

  50. Reference • [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 • [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 • [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

More Related