Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2 1University of California, Irvine 2Renmin University of China

Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion

Motivation: Data Cleaning Should clearly be “Niels Bohr” • Real-world data is dirty • Typos • Inconsistent representations • (PO Box vs. P.O. Box) • Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Motivation: Record Linkage We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker

Motivation: Query Relaxation • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

What is Approximate String Search? Queries against collection: Find all entries similarto“Forrest Whitaker” Find all entries similarto“Arnold Schwarzenegger” Find all entries similarto“Brittany Spears” String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … … … • What do we mean by similar to? • Edit Distance • Jaccard Similarity • Cosine Similaity • Dice • Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?

Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query

Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} Lookup Grams … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 1 5 1 2 3 9 3 9 7 9 5 6 9 Inverted Lists (stringIDs) Candidates = {1, 5, 9} May have false positives Need to compute real similarity Each edit operations can “destroy” at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

Motivation: Compression Inverted index can be very large compared to source data May need to fit in memory for fast query processing Can we compress the index to fit into a space budget? Index-Size Estimation Each string produces |s| - q + 1 grams For each gram we add one element to its inverted list (a 4-byte uint) With ASCII encoding the index is ~4x as large as the original data!

Motivation: Related Work IR community developed many lossless compression algorithms for inverted lists (mostly in a disk-based setting) Mainly use delta representation + packing If inverted lists are in memory these techniques always impose decompression overhead Difficult to tunecompression ratio How to overcome these limitations in our setting?

This Paper We developed two lossy compressiontechniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)

Approach 1: Discarding Lists B E FORE … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 1 5 1 2 3 9 3 9 7 9 5 6 9 Inverted Lists (stringIDs) … tf vi ir ef rv ne un … in 2-grams A F TER 1 2 4 5 6 5 9 1 5 7 9 5 6 9 Inverted Lists (stringIDs) Lists discarded, “Holes”

Effects on Queries • Need to decrease merging-threshold T • Lower T  more false positives to post-process • If T <= 0 we “panic”, need to scan entire collection and compute true similarities • Surprisingly! QueryProcessing time can decrease because fewerliststo consider

Query “shanghai”, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} Hole grams Regular grams ing han ngh hai … uni sha ang gha ter 3-grams Merging-threshold without holes, T = #grams – ed * q = 6 – 1 * 3 = 3 Basis: Each Edit Operation can “destroy” at most q=3 grams Naïve new Merging-Threshold T’ = T – #holes = 0  Panic! Can we really destroy at most q=3 non-hole grams with each edit operation? han ngh hai sha ang gha Delete “a” Delete “g” Can destroy at most 2 grams with 1 Edit Operation! New Merging-Threshold T’ = 1 We use Dynamic Programming to compute tighter T’

Choosing Lists to Discard • One extreme: query is entirely unaffected • Other extreme: query becomes panic • Good choice of lists depends on query workload • Many combinations of lists to discard that satisfy memory constraint, checking all is infeasible • How can we make a “reasonable” choice efficiently?

Choosing Lists to Discard Input: Memory Constraint Inverted Lists L Query Workload W Output: Lists to Discard D DiscardLists { While(Memory Constraint Not Satisfied) { For each list in L { ∆t = estimateImpact(list, W) benefit = list.size() } discard = use ∆t’s and benefits to choose list add discard to D remove discard from L } } How can we do this efficiently? Perhaps incrementally? Times needed: List-Merging Time Post-Processing Time Panic Time What exactly should we minimize? benefit / cost? cost only? We could ignore benefit…

Choosing Lists to Discard Estimating Query Times With Holes List-Merging Time: cost function, parameters decided offline with linear regression Post-Processing Time: #candidates * average compute similarity time Panic Time: #strings * average compute similarity time #candidates depends on T, data distribution, number of holes Incremental-ScanCount Algorithm Before Discarding List T = 3 #candidates = 3 Counts 2 0 3 3 2 4 0 0 1 0 List to discard 0 1 2 3 4 5 6 7 8 9 StringIDs 2 3 decrement counts 4 8 After Discarding List T’ = T – 1 = 2 #candidates = 4 Counts 2 0 2 2 1 4 0 0 0 0 0 1 2 3 4 5 6 7 8 9 StringIDs Many more ways to improve speed of DiscardLists, this is just one example…

Approach 2: Combining Lists B E FORE … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 5 6 9 1 2 3 9 1 3 9 7 9 6 9 Inverted Lists (stringIDs) … tf vi ir ef rv ne un … in 2-grams A F TER 1 2 4 5 6 1 3 4 5 7 9 7 9 6 9 1 2 3 9 5 6 9 Inverted Lists (stringIDs) Lists combined Intuition: Combine correlated lists.

Effects on Queries • Merging-threshold T is unchanged (no new panics) • Lists become longer: • More time to traverse lists • More false positives List-Merging Optimization 3-grams {sha, han, ang, ngh, gha, hai} Traverse physical lists once. Count for stringIDs on physical lists increased by refcount instead of 1 combined refcount = 2 combined refcount = 3

Choosing Lists to Combine • Discovering candidate gram pairs • Frequent q+1-grams  correlated adjacent q-grams • Using Locality-Sensitive Hashing (LSH) • Selecting candidate pairs to combine • Based on estimated cost on query workload • Similar to DiscardList • Different Incremental ScanCount algorithm

Experiments • Datasets: • Google WebCorpus (word grams) • IMDB Actors • Queries: picked from dataset, Zipf distributed • q=3, Edit Distance=2 • Overview: • Performance of flavors of DiscardLists & CombineLists • Scalability with increasing index size • Comparison with IR compression technique • Comparison with VGRAM • What if workload changes from training workload

Experiments DiscardLists CombineLists Runtime decreases! Runtime decreases!

Experiments Comparison with IR compression technique Compressed Uncompressed Compressed Uncompressed

Experiments Comparison with variable-length gram technique, VGRAM Compressed Uncompressed Uncompressed Compressed

Future Work DiscardLists, CombineLists and IR compression could be combined When considering filter tree, global vs. local decisions How to minimize impact on performance if workload change

Conclusion We developed two lossy compressiontechniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)

More Experiments What if the workload changes from the training workload?

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Presentation Transcript

Efficient Approximate Search on String Collections Part II

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Indexing Mixed Types for Approximate Retrieval

Efficient Merging and Filtering Algorithms for Approximate String Searches

N-gram Based Indexing for Marathi Monolingual Search

Approximate L0 constrained NMF/NTF

Approximate String Matching

Indexing similarity for efficient search in multimedia databases

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Approximate Search on String Collections Part I

Rules for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching

Graphic : Nearest Neighbor Search for Distance Based Indexing

Efficient Merging and Filtering Algorithms for Approximate String Searches

Space-Efficient String Mining under Frequency Constraints

Filter Algorithms for Approximate String Matching

Approximate String Matching

Efficient Approximate Search on String Collections Part II

Vakhitov Alexander Approximate Text Indexing.

Efficient Approximate Search on String Collections Part II