Loading in 5 sec....

Efficient Approximate Search on String Collections Part IIPowerPoint Presentation

Efficient Approximate Search on String Collections Part II

- By
**paul2** - Follow User

- 255 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'efficient approximate search on string collections part ii' - paul2

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Motivation and preliminaries
- Inverted list based algorithms
- Gram Signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions

N-Gram Signatures

- Use string signatures that upper bound similarity
- Use signatures as filtering step
- Properties:
- Signature has to have small size
- Signature verification must be fast
- False positives/False negatives

- Signatures have to be “indexable”

Known signatures

- Minhash
- Jaccard, Edit distance

- Prefix filter (CGK06)
- Jaccard, Edit distance

- PartEnum (AGK06)
- Hamming, Jaccard, Edit distance

- LSH (GIM99)
- Jaccard, Edit distance

- Mismatch filter (XWL08)
- Edit distance

4

5

7

8

9

10

12

13

1

2

6

11

14

q

s

Prefix Filter- Bit vectors:
- Mismatch vector:
s: matches 6, missing 2, extra 2

- If |sq|6 then s’s s.t. |s’|3, |s’q|
- For at least k matches, |s’| = l - k + 1

Using Prefixes

- Take a random permutation of n-gram universe:
- Take prefixes from both sets:
- |s’|=|q’|=3, if |sq|6 then s’q’

11

14

8

2

3

4

5

10

12

6

9

1

7

13

q

s

t1

t2

t4

t6

t8

t11

t14

w1

w1

w2

w2

0

0

w4

w4

0

0

q

s

α

w(s)-α

s’

s/s’

Prefix Filter for Weighted Sets- For example:
- Order n-grams by weight (new coordinate space)
- Query: w(qs)=Σiqswi τ
- Keep prefix s’ s.t. w(s’) w(s) - α
- Best case: w(q/q’s/s’) = α
- Hence, we need w(q’s’) τ-α

w1 w2 … w14

Prefix Filter Properties

- The larger we make α, the smaller the prefix
- The larger we make α, the smaller the range of thresholds we can support:
- Because τα, otherwise τ-α is negative.

- We need to pre-specify minimum τ
- Can apply to Jaccard, Edit Distance, IDF

Other Signatures

- Minhash (still to come)
- PartEnum:
- Upper bounds Hamming
- Select multiple subsets instead of one prefix
- Larger signature, but stronger guarantee

- LSH:
- Probabilistic with guarantees
- Based on hashing

- Mismatch filter:
- Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance

Signature Indexing

- Straightforward solution:
- Create an inverted index on signature n-grams
- Merge inverted lists to compute signature intersections
- For a given string q:
- Access only lists in q’
- Find strings s with w(q’ ∩ s’) ≥ τ - α

The Inverted Signature Hashtable (CCVX08)

- Maintain a signature vector for every n-gram
- Consider prefix signatures for simplicity:
- s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=…
- co-occurence lists: ‘t L’: ‘tt ’ ‘t&t’ …
‘&tt’: ‘t L’ …

- Hash all n-grams (h: n-gram [0, m])
- Convert co-occurrence lists to bit-vectors of size m

Signatures

lab

s’1

5

at&, la

s’2

at&

4

t&t, at&

s’3

t&t

5

t L, at&

s’4

t L

1

abo, t&t

s’5

la

0

t&t, la

…

…

Hashtable

at&

100011

t&t

010101

lab

…

t L

la

…

Exampleat&

lab

t&t

res

…

q’

1

1

1

0

…

at&

r

lab

1

1

0

1

…

p

Using the Hashtable?- Let list ‘at&’ correspond to bit-vector 100011
- There exists string s s.t. ‘at&’ s’ and s’ also contains some n-grams that hash to 0, 1, or 5

- Given query q:
- Construct query signature matrix:
- Consider only solid sub-matrices P: rq’, pq
- We need to look only at rq’ such that w(r)τ-α and w(p)τ

Verification

- How do we find which strings correspond to a given sub-matrix?
- Create an inverted index on string n-grams
- Examine only lists in r and strings with w(s)τ
- Remember that rq’

- Can be used with other signatures as well

Outline

- Motivation and preliminaries
- Inverted list based algorithms
- Gram Signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions

Length Normalized Measures

- What is normalization?
- Normalize similarity scores by the length of the strings.
- Can result in more meaningful matches.

- Can use L0 (i.e., the length of the string), L1, L2, etc.
- For example L2:
- Let w2(s) Σtsw(t)2
- Weight can be IDF, unary, language model, etc.
- ||s||2 =w2(s)-1/2

- Normalize similarity scores by the length of the strings.

The L2-Length Filter (HCKS08)

- Why L2?
- For almost exact matches.
- Two strings match only if:
- They have very similar n-gram sets, and hence L2 lengths
- The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).

Example

- “AT&T Labs – Research” L2=100
- “ATT Labs – Research” L2=95
- “AT&T Labs” L2=70
- If “Research” happened to be very popular and had small weight?

- “The Dark Knight” L2=75
- “Dark Night” L2=72

Why L2 (continued)

- Tight L2-based length filtering will result in very efficient pruning.
- L2 yields scores bounded within [0, 1]:
- 1 means a truly perfect match.
- Easier to interpret scores.
- L0 and L1 do not have the same properties
- Scores are bounded only by the largest string length in the database.
- For L0 an exact match can have score smaller than a non-exact match!

Example

- q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’} L0=5
- s1={‘ATT’} L0=1
- s2=q L0=5
- S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2
- S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2

Problems

- L2 normalization poses challenges.
- For example:
- S(q, s) = w2(qs)/(||q||2 ||s||2)
- Prefix filter cannot be applied.
- Minimum prefix weight α?
- Value depends both on ||s||2 and ||q||2.
- But ||q||2 is unknown at index construction time

- For example:

Important L2 Properties

- Length filtering:
- For S(q, s) ≥ τ
- τ||q||2 ||s||2 ||q||2 / τ
- We are only looking for strings within these lengths.
- Proof in paper

- Monotonicity …

Monotonicity

- Let s={t1, t2, …, tm}.
- Let pw(s, t)=w(t) / ||s||2(partial weight of s)
- Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)=
Σtqspw(s, t) pw(q, t)

- If pw(s, t) > pw(r, t):
- w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2

- Hence, for any t’ t:
- w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)

at

ch

ck

ic

ri

st

ta

ti

tu

uc

0

1

2

3

4

rich

stick

stich

stuck

static

2-grams

2

3

3

4

3

1

0

4

2

1

0

3

1

0

4

1

2

4

4

2

Indexing- Use inverted lists sorted by pw():

- pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic)
- ||0||2 < ||4||2 < ||1||2 < ||2||2

4

0

0

3

at

ch

ck

ic

ri

st

ta

ti

tu

uc

4

2

0

0

4

2

0

0

2

2

3

4

4

4

4

2

4

4

2

1

2

1

1

1

3

3

L2 Length Filter- Given q and τ, and using length filtering:

- We examine only a small fraction of the lists

ch

ck

ic

ri

st

ta

ti

tu

uc

2

4

3

2

1

2

4

0

4

3

0

4

1

2

0

Monotonicity- If I have seen 1 already, then 4 is not in the list:

3

1

3

1

4

Other Improvements

- Use properties of weighting scheme
- Scan high weight lists first
- Prune according to string length and maximum potential score
- Ignore low weight lists altogether

Conclusion

- Concepts can be extended easily for:
- BM25
- Weighted Jaccard
- DICE
- IDF

- Take away message:
- Properties of similarity/distance function can play big role in designing very fast indexes.
- L2 super fast for almost exact matches

Outline

- Motivation and preliminaries
- Inverted list based algorithms
- Gram signature algorithms
- Length-normalized measures
- Selectivity estimation
- Conclusion and future directions

The Problem

- Estimate the number of strings with:
- Edit distance smaller than k from query q
- Cosine similarity higher than τ to query q
- Jaccard, Hamming, etc…

- Issues:
- Estimation accuracy
- Size of estimator
- Cost of estimation

Motivation

- Query optimization:
- Selectivity of query predicates
- Need to support selectivity of approximate string predicates

- Visualization/Querying:
- Expected result set size helps with visualization
- Result set size important for remote query processing

Flavors

- Edit distance:
- Based on clustering (JL05)
- Based on min-hash (MBKS07)
- Based on wild-card n-grams (LNS07)

- Cosine similarity:
- Based on sampling (HYKS08)

Selectivity Estimation for Edit Distance

- Problem:
- Given query string q
- Estimate number of strings s D
- Such that ed(q, s) δ

Sepia - Clustering (JL05, JLV08)

- Partition strings using clustering:
- Enables pruning of whole clusters

- Store per cluster histograms:
- Number of strings within edit distance 0,1,…,δ from the cluster center

- Compute global dataset statistics:
- Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query

Edit Vectors

- Edit distance is not discriminative:
- Use Edit Vectors
- 3D space vs 1D space

Ci

Luciano

<2,0,0>

2

<1,1,1>

3

Lukas

Lucia

pi

q

Lucas

<1,1,0>

2

Edit Vector

Cn

pn

<0, 0, 0>

4

C1

p1

F1

<0, 0, 1>

12

<1, 0, 2>

7

…

#

Edit Vector

<0, 0, 0>

3

C2

F2

p2

<0, 1, 0>

40

<1, 0, 1>

6

…

v(q,pi)

v(pi,s)

ed(q,s)

#

%

<1, 0, 1>

<0, 0, 1>

1

1

14

2

<1, 0, 1>

<0, 0, 1>

4

57

#

Edit Vector

3

<1, 0, 1>

<0, 0, 1>

7

100

<0, 0, 0>

2

…

…

Fn

<1, 0, 2>

84

<1, 1, 0>

<1, 0, 2>

3

21

25

<1, 1, 1>

1

<1, 1, 0>

<1, 0, 2>

4

63

75

…

<1, 1, 0>

<1, 0, 2>

5

84

100

…

…

Visually...

Global Table

Selectivity Estimation

- Use triangle inequality:
- Compute edit vector v(q,pi) for all clusters i
- If |v(q,pi)| ri+δ disregard cluster Ci

δ

ri

pi

q

Selectivity Estimation

- Use triangle inequality:
- Compute edit vector v(q,pi) for all clusters i
- If |v(q,pi)| ri+δ disregard cluster Ci
- For all entries in frequency table:
- If |v(q,pi)| + |v(pi,s)| δ then ed(q,s) δ for all s
- If ||v(q,pi)| - |v(pi,s)|| δ ignore these strings
- Else use global table:
- Lookup entry <v(q,pi), v(pi,s), δ> in global table
- Use the estimated fraction of strings

Edit Vector

<0, 0, 0>

4

F1

<0, 0, 1>

12

<1, 0, 2>

7

…

v(q,pi)

v(pi,s)

ed(q,s)

#

%

<1, 0, 1>

<0, 0, 1>

1

1

14

2

<1, 0, 1>

<0, 0, 1>

4

57

3

<1, 0, 1>

<0, 0, 1>

7

100

…

…

<1, 1, 0>

<1, 0, 2>

3

21

25

<1, 1, 0>

<1, 0, 2>

4

63

75

<1, 1, 0>

<1, 0, 2>

5

84

100

…

…

Example- δ =3
- v(q,p1) = <1,1,0> v(p1,s) = <1,0,2>
- Global lookup:
[<1,1,0>,<1,0,2>, 3]

- Fraction is 25% x 7 = 1.75
- Iterate through F1, and add up contributions

Global Table

Cons

- Hard to maintain if clusters start drifting
- Hard to find good number of clusters
- Space/Time tradeoffs

- Needs training to construct good dataset statistics table

VSol – minhash (MBKS07)

- Solution based on minhash
- minhash is used for:
- Estimate the size of a set |s|
- Estimate resemblance of two sets
- I.e., estimating the size of J=|s1s2| / |s1s2|

- Estimate the size of the union |s1s2|
- Hence, estimating the size of the intersection
- |s1s2| J~(s1, s2) ~(s1, s2)

Minhash

- Given a set s = {t1, …, tm}
- Use independent hash functions h1, …, hk:
- hi: n-gram [0, 1]

- Hash elements of s, k times
- Keep the k elements that hashed to the smallest value each time
- We reduced set s, from m to k elements
- Denote minhash signature with s’

How to use minhash

- Given two signatures q’, s’:
- J(q, s) Σ1ik I{q’[i]=s’[i]} / k
- |s| ( k / Σ1ik s’[i] ) - 1
- (qs)’ = q’ s’ = min1ik(q’[i], s’[i])
- Hence:
- |qs| (k / Σ1ik (qs)’[i]) - 1

t1

t2

…

t10

1

3

1

5

5

8

Inverted list

…

…

…

14

25

43

Minhash

VSol Estimator- Construct one inverted list per n-gram in D
- The lists are our sets

- Compute a minhash signature for each list

Selectivity Estimation

- Use edit distance length filter:
- If ed(q, s) δ, then q and s share at least L = |s| - 1 - n (δ-1)
n-grams

- If ed(q, s) δ, then q and s share at least L = |s| - 1 - n (δ-1)
- Given query q = {t1, …, tm}:
- Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L)
- We can estimate sizes of L-intersections using minhash signatures

t1

t2

…

t10

1

3

1

5

5

8

…

…

…

14

25

43

Example- δ = 2, n = 3 L = 6
- Look at all 6-intersections of inverted lists
- Α = |ι1, ..., ι6 [1,10](ti1 ti2 … ti6)|
- There are (10 choose 6) such terms

Inverted list

The m-L Similarity

- Can be done efficiently using minhashes
- Answer:
- ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] }
- A ρ |t1… tm|

- Proof very similar to the proof for minhashes

Cons

- Will overestimate results
- Many L-intersections will share strings
- Edit distance length filter is loose

OptEQ – wild-card n-grams (LNS07)

- Use extended n-grams:
- Introduce wild-card symbol ‘?’
- E.g., “ab?” can be:
- “aba”, “abb”, “abc”, …

- Build an extended n-gram table:
- Extract all 1-grams, 2-grams, …, n-grams
- Generalize to extended 2-grams, …, n-grams
- Maintain an extended n-grams/frequency hashtable

n-gram

Frequency

ab

10

Dataset

bc

15

string

de

4

ef

1

abc

gh

21

def

hi

2

ghi

…

…

…

?b

13

a?

17

?c

23

…

…

abc

5

def

2

…

…

ExampleQuery Expansion (Replacements only)

- Given query q=“abcd”
- δ=2
- And replacements only:
- Base strings:
- “??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”

- Query answer:
- S1={sD: s ”??cd”}, S2=…
- A = |S1 S2 S3 S4 S5 S6|=
Σ1n6 (-1)n-1 |S1 … Sn|

- Base strings:

Replacement Intersection Lattice

A = Σ1n6 (-1)n-1 |S1 … Sn|

- Need to evaluate size of all 2-intersections, 3-intersections, …, 6-intersections
- Then, use n-gram table to compute sum A
- Exponential number of intersections
- But ... there is well-defined structure

abcd

?b?d

a??d

ab??

?bcd

a?cd

abc?

??cd

?bc?

a?c?

Replacement Lattice- Build replacement lattice:
- Many intersections are empty
- Others produce the same results
- we need to count everything only once

2 ‘?’

1 ‘?’

0 ‘?’

General Formulas

- Similar reasoning for:
- r replacements
- d deletions

- Other combinations difficult:
- Multiple insertions
- Combinations of insertions/replacements

- But … we can generate the corresponding lattice algorithmically!
- Expensive but possible

BasicEQ

- Partition strings by length:
- Query q with length l
- Possible matching strings with lengths:
- [l-δ, l+δ]

- For k = l-δ to l+δ
- Find all combinations of i+d+r = δ and l+i-d=k
- If (i,d,r) is a special case use formula
- Else generate lattice incrementally:
- Start from query base strings (easy to generate)
- Begin with 2-intersections and build from there

OptEq

- Details are cumbersome
- Left for homework

- Various optimizations possible to reduce complexity

Cons

- Fairly complicated implementation
- Expensive
- Works for small edit distance only

Hashed Sampling (HYKS08)

- Used to estimate selectivity of TF/IDF, BM25, DICE (vector space model)
- Main idea:
- Take a sample of the inverted index
- But do it intelligently to improve variance

0

at

ch

ck

ic

ri

st

ta

ti

tu

uc

2

4

2

4

1

1

1

4

3

0

4

2

0

1

4

4

3

3

2

2

1

1

2

3

Example- Take a sample of the inverted index

ch

ck

ic

ri

st

ta

ti

tu

uc

4

2

2

4

4

3

2

4

4

2

Example (Cont.)- But do it intelligently to improve variance

0

0

1

1

0

0

1

1

0

0

3

1

1

1

1

3

3

Construction

- Draw samples deterministically:
- Use a hash function h: N [0, 100]
- Keep ids that hash to values smaller than σ

- Invariant:
- If a given id is sampled in one list, it will always be sampled in all other lists that contain it:
- S(q, s) can be computed directly from the sample
- No need to store complete sets in the sample
- No need for extra I/O to compute scores

- If a given id is sampled in one list, it will always be sampled in all other lists that contain it:

Selectivity Estimation

- The union of arbitrary list samples is an σ% sample
- Given query q = {t1, …, tm}:
- A = |Aσ| |t1 … tm| / |tσ1 … tσm|:
- Aσ is the query answer size from the sample
- The fraction is the actual scale-up factor
- But there are duplicates in these unions!

- We need to know:
- The distinct number of ids in t1 … tm
- The distinct number of ids in tσ1 … tσm

- A = |Aσ| |t1 … tm| / |tσ1 … tσm|:

Count Distinct

- Distinct |tσ1 … tσm| is easy:
- Scan the sampled lists

- Distinct |t1 … tm| is hard:
- Scanning the lists is the same as computing the exact answer to the query … naively
- We are lucky:
- Each list sample doubles up as a k-minimum value estimator by construction!
- We can use the list samples to estimate the distinct |t1 … tm|

0

100

hr

hr r

100 ?

The k-Minimum Value Synopsis- It is used to estimated the distinct size of arbitrary set unions (the same as FM sketch):
- Take hash function h: N [0, 100]
- Hash each element of the set
- The r-th smallest hash value is an unbiased estimator of count distinct:

Outline

- Motivation and preliminaries
- Inverted list based algorithms
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions

Future Directions

- Result ranking
- In practice need to run multiple types of searches
- Need to identify the “best” results

- Diversity of query results
- Some queries have multiple meanings
- E.g., “Jaguar”

- Updates
- Incremental maintenance

References

- [AGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006
- [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu, ICDE 2009
- [HCK+08] Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava: Fast Indexes and Algorithms for Set Similarity Selection Queries. ICDE 2008
- [HYK+08] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 2008.
- [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Liang Jin, and Chen Li. VLDB 2005.
- [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. SIGMOD 2006.
- [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
- [LNS07] Hongrae Lee, Raymond T. Ng, Kyuseok Shim: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. VLDB 2007
- [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007
- [MBK+07] Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava: Estimating the selectivity of approximate string queries. ACM TODS 2007
- [XWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008

References

- [XWL+08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW 2008.
- [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently, Xiaochun Yang, Bin Wang, and Chen Li, SIGMOD 2008
- [JLV08]L. Jin, C. Li, R. Vernica: SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases, VLDBJ08
- [CGK06] S. Chaudhuri, V. Ganti, R. Kaushik : A Primitive Operator for Similarity Joins in Data Cleaning, ICDE06
- [CCGX08]K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin: An Efficient Filter for Approximate Membership Checking, SIGMOD08
- [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754
- [BK02] Jérémy Barbay, Claire Kenyon: Adaptive intersection and t-threshold problems. SODA 2002: 390-399
- [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920

Download Presentation

Connecting to Server..