Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Example: a movie database Find movies starred Schwarrzenger.

In general: Gap between Queries and Data • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string Query: Schwarrzenger. Data :Schwarzenegger … …

Data may not clean • Errors in the database: • Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S

Query may include error

Problem definition: approximate string searches Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ

Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Example: approximate string searches Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom J. Hanks Tom Hanks … Output: strings s that satisfy ed(q,s)≤2

Outline • Problem motivation • Preliminary • Grams • Inverted lists • Merge algorithms • Filtering technique • Conclusion

String  Grams q-grams For example: 2-gram (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Inverted lists • Convert strings to gram inverted lists

Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti … 1,3 Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4

Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.

Example • Count threshold: 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

Outline • Problem motivation • Preliminary • Merge algorithms • Two previous algorithms • Our proposed three algorithms • Filtering technique • Conclusion

Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (1) Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap

Example of HeapMerger [Sarawagi et al 2004] 1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger [Sarawagi 2004] Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (2) MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists

Example of MergeOpt [Sarawagi et al 2004] Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4

Can we run faster?

Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip

Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

ScanCount Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Our new algorithms (2) MergeSkip algorithm Pop T-1 Min-heap …… Jump T-1

Example of MergeSkip minHeap 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4

Example of HeapMerger minHeap 13 13 13 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4

Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists

Size of long lists How many lists are treated as long lists? Cost: MergeOpt Binary search Long Lists Short Lists 35

Size of long lists How many lists are treated as long lists? Cost: MergeSkip Binary search Long Lists Short Lists 36

Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 37

Empirically verification Our formula about “L” achieves the best result over other options. 38

Experimental data sets Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus

Performance (DBLP data) DivideSkip is the best one Running time per query with various algorithms

# of elements reading (DBLP data) DivideSkip is the best one DivideSkip skips reading the most elements

Outline • Problem motivation • Preliminary • Merge algorithms • Filtering technique • Length, positional filter [Gravano et al. VLDB 2001] • Filter tree • Conclusion and future work

Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

Positional Filtering • Positional Gram • For example: string abcd: • {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

root … 1 2 3 n … aa ab zy zz 1 2 m Filter tree Length level Gram level … Position level 5 12 17 28 44 Inverted list

Surprising experimental results(DBLP) Wisely use filters, more filters may be bad!

Conclusion • Three newmergealgorithms • We run faster • Surprising experimental results Wisely use filters, more filters may be bad!

Thank you!

Backup : related work Approximate string matching [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]

Reference • [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 • [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 • [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

Efficient Merging and Filtering Algorithms for Approximate String Searches