1 / 53

Efficient Approximate Search on String Collections Part I

Efficient Approximate Search on String Collections Part I. Marios Hadjieleftheriou. Chen Li. DBLP Author Search. http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. Try their names (good luck!). http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. UCSD.

clover
Download Presentation

Efficient Approximate Search on String Collections Part I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Approximate Search on String CollectionsPart I Marios Hadjieleftheriou Chen Li

  2. DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html

  3. Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html UCSD Case Western AT&T--Research Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

  4. Better system? http://dblp.ics.uci.edu/authors/

  5. People Search at UC Irvine http://psearch.ics.uci.edu/

  6. Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

  7. Data Cleaning R S

  8. Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS

  9. Outline • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II

  10. Next… Preliminaries

  11. Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey

  12. A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13

  13. Next… Gram-based algorithms • List-merging algorithms [LLL08] • Variable-length grams (VGRAM) [LWY07,YWL08]

  14. “q-grams” of strings u n i v e r s a l 2-grams

  15. Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) –k *q

  16. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists

  17. # of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams

  18. T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

  19. Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

  20. List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

  21. Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

  22. MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists

  23. Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

  24. ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 25

  25. List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

  26. MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1

  27. Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

  28. DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists

  29. How many lists are treated as long lists?

  30. Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

  31. Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

  32. A filter tree Combine filters with list-merging algorithms [LLL08]

  33. Next… Variable-length grams (VGRAM) [LWY07,YWL08]

  34. ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams

  35. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams  Shorter lists • Smaller # of common grams of similar strings

  36. Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio

  37. VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size  • Reducing running time  • Adoptable by many algorithms 

  38. Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?

  39. [2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String  Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l

  40. Representing gram dictionary as a trie ni ivr sal uni vers

  41. Step 2: Constructing a gram dictionary qmin=2 qmax=4 • Frequency-based [LYW07] • Cost-based [YLW08]

  42. Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

  43. Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion

  44. Main idea • For a string, for each position, compute the number of grams that could be destroyed by an operation at this position • Compute number of grams possibly destroyed by k operations • Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> • Use this number to do count filtering With 2 edit operations, at most 4 grams can be affected

  45. Summary of VGRAM index

  46. Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: • String s  grams • String s1, s2 such that ed(s1,s2) <= k  min # of their common grams

  47. Lower bound on # of common grams Fixed length (q) If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) –k *q u n i v e r s a l Variable lengths:# of grams of s1 – NAG(s1,k)

  48. ck ic ich … tic tick … id strings id strings id strings … ck ic … ti … 1 3 1 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 4 1 4 1 2 0 2 0 1 2 4 2 4 1 Example: algorithm using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 tick sh ht tick 2-grams 2-4 grams Lower bound = 3 Lower bound = 1

  49. End of part I • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II

More Related