VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University)

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee

Introduction • Many applications have an increasing need to support approximate string queries on data collections • Examples of approximate string queries • Data Cleaning – the same entity can be represented in slightly different forms • “PO BOX 23” and “P.O. Box 23” • Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data • “Steven Spielburg” and “Steve Spielberg” • Spellchecking – find potential candidates for a possibly mistyped word Center for E-Business Technology

ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings id strings at ch ck ic ri st ta ti tu uc 4 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 2 0 1 1 3 0 0 1 2 4 4 0 2 1 4 2 3 1 3 4 4 1 2 4 2 1 4 3 3 3 3 Introduction • Dilemma of Choosing Gram Length • The gram length can greatly affect the performance of string matches • Increasing gram length • Causes the inverted list to be shorter • This may decrease the time to merge the inverted lists • Cases the lower threshold on the number of common grams • This causes a less selectiveness # of common grams >= 3 # of common grams >= 1 2-grams 3-grams Center for E-Business Technology

VGRAM : Main Idea • We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary • For a string, we generate a set of grams of variable lengths using the gram dictionary • Challenges • How to generate variable-length grams ? • How to construct a high-quality gram dictionary ? • What is the relationship between string similarity and their gram-set similarity? • How to adopt VGRAM in existing algorithms ? Center for E-Business Technology

Challenge 1 : Generating Variable-Length Grams • Example • String s = universal • D = {ni, ivr, sal, uni, vers} • qmin = 2, qmax = 4 • By setting position p = 1, VG = {} • The longest substring starting at u that appears in D is uni  (1, uni) • Move to the next character n, the longest substring is ni • However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG • Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of lengthqmin = 2 • Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)} Center for E-Business Technology

Challenge 2:Constructing Gram Dictionary • Step 1 : Collecting gram frequencies with length in [qmin =2, qmax =4] st  0, 1, 3 sti 0, 1 stu3 stic 0, 1 stuc3 Leaf node Center for E-Business Technology

Challenge 2:Constructing Gram Dictionary • Step 2: Selecting High-Quality Grams • If a gram has a low frequency, we eliminate from the tree all the extended grams of g • If a gram is very frequent, keep some of its extended grams Center for E-Business Technology

Challenge 2:Constructing Gram Dictionary • Pruning tree using a frequency threshold T = 2 • Frequency of node (which has leaf node) ≤ T removed 8 Center for E-Business Technology

Challenge 2:Constructing Gram Dictionary • Pruning tree using a frequency threshold T = 2 • Frequency of node (which has leaf node) ≥ T • Pruning policies to be used to select a maximal subset of children to remove • SmallFirst : choose children with the smallest frequencies • LargeFirst : choose children with the largest frequencies • Random : Randomly choose children so that L.freq is not greater than T Center for E-Business Technology

Challenge 3:Similarity of Gram Sets • Analyzing the effect of an edit operation on the positional grams • These effects are stored NAG Vector (the vector of number of affected grams) • Category 1 : for positional gram (p, g) • p < i-qmax+1 or p+|g| -1 > i+qmax-1 • Category 2 : p ≤ i ≤ p+|g| -1 • Category 3 : positional gram (p, g) on the left of the i-th character • Category 4 : positional gram (p, g) on the right of the i-th character Category 2 Category 3 Category 4 Category 1 Category 1 String s i i-qmax+1 i+qmax- 1 Deletion Center for E-Business Technology

Challenge 3:Similarity of Gram Sets • Example • S = universal, D= {ni, ivr, sal, uni, vers}, qmin = 2, qmax = 4 • VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)} • A deletion on the 5-th character e in the string s • i-qmax +1 =2 , i+qmax -1 = 8 • Positional gram (1, uni) and (7, sal) is category 1 • Starting position is before 2 / ending position is after 8 • These gram are not affected by deletion operation • (4, vers) is category 2 • (3, iv) is category 3 • Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected) Center for E-Business Technology

Challenge 3:Similarity of Gram Sets • # of grams affected by each operation • We want to transform string s to string s’ with 2 edit operations • At most 4 grams can be affected Deletion/substitution Insertion 1 1 1 1 2 1 1 1 1 1 0 0 1 1 2 1 1 2 1 _ u _ n _ i _v _ e _ r _s _ a _ l _ GAP ; insertion ? # of edit operation # of grams String S’ Center for E-Business Technology

… ck ic ich … tic tick … id strings … ck ic … ti … 1 3 1 3 0 1 2 3 4 rich stick stich stuck static 4 1 4 1 2 0 2 0 1 2 4 2 4 1 Challenge 4: Adopting VGRAM Technique • Example of Algorithm based on Inverted Lists • Query : Edit Distance (shtick , ?) ≤ 1 • VG(q) = { (1, sh), (2, ht), (3, tick) } ; whichare extracted using gram dictionary 2-4 grams 2 grams • # of common grams • = |VG(q)| - NAG(q, k) • = 3 – 2 = 1 • # of common grams • = (|s1|- q + 1) –k *q • = (6-2+1) – 1 * 2 = 3 Center for E-Business Technology

Experiments • Data Sets • Data set 1: Texas Real Estate Commission. • 151Kperson names, average length = 33. • Data set 2: English dictionary from the Aspell spellchecker for Cygwin. • 149,165 words, average length = 8. • Data set 3: DBLP Bibliography. • 277K titles, average length = 62. Center for E-Business Technology

VGRAM Overhead • Data set 3 Index Size Construction Time Center for E-Business Technology

Benefits of Using Variable-Length Grams • Data set 1 Construction Time/Size Query Time Center for E-Business Technology

Effect of qmax • Data Set 1 Construction Time / Query Time Query Performance Center for E-Business Technology

Effect of Frequency Threshold • Data Set 1 Index Size Query Time Construction Time Center for E-Business Technology

Conclusion • We developed VGRAM to improve performance of approximate string queries • Variable-length grams, High Quality grams • We gave a full specification of the technique • Index structure • How to generate grams for a string using index structure • Relationship btw the similarity of two strings and the similarity of their grams • We show how to adopt this technique in a variety of existing algorithms Center for E-Business Technology

VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University)

VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University)

Presentation Transcript

UC Irvine Anteaters vs UCLA Bruins Live NCAA Basketball Stre

Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1 , Guoliang Li 2 , Chen Li 1 , Jianhua Feng 2 1 University

Preparing for Success in Algebra English Language Learners in Mathematics

Web-Accessible File Storage Audrey Bersot, UC Berkeley abersot@cafe.berkeley.edu Stephen D. Franklin, UC Irvine franklin

Evaluation of Traffic Delay Reduction from Automatic Workzone Information Systems Using Micro-simulation

UC IRVINE

UC Irvine Health Jon D. Gilwee, Executive Director, Gov’t Affairs ACC-OC 4 th Annual City Infrastructure Summit May 30,

University of California Enterprise Architecture: A Case Study ITANA Face2Face - October, 2013

Indexing Mixed Types for Approximate Retrieval

Link-Trace Sampling for Social Networks: Advances and Applications

California State University, Northridge Bicycle Shop Feasibility Analysis

Chen Li ( 李晨 )

UC Irvine

College Information for Seniors

Hawaii Pacific University and University of California Irvine

Semantic Representations with Probabilistic Topic Models

Applications of Transition State in System Biology

UC Irvine Department of Medicine

Analyzing unstructured text with topic models

Indexing Mixed Types for Approximate Retrieval

Pregelix: Think Like a Vertex, Scale Like Spandex