Efficient approximate search on string collections part ii
Download
1 / 68

- PowerPoint PPT Presentation


  • 256 Views
  • Uploaded on

Efficient Approximate Search on String Collections Part II. Marios Hadjieleftheriou. Chen Li. Outline. Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient approximate search on string collections part ii l.jpg

Efficient Approximate Search on String CollectionsPart II

Marios Hadjieleftheriou

Chen Li


Outline l.jpg
Outline

  • Motivation and preliminaries

  • Inverted list based algorithms

  • Gram Signature algorithms

  • Length normalized algorithms

  • Selectivity estimation

  • Conclusion and future directions


N gram signatures l.jpg
N-Gram Signatures

  • Use string signatures that upper bound similarity

  • Use signatures as filtering step

  • Properties:

    • Signature has to have small size

    • Signature verification must be fast

    • False positives/False negatives

  • Signatures have to be “indexable”


Known signatures l.jpg
Known signatures

  • Minhash

    • Jaccard, Edit distance

  • Prefix filter (CGK06)

    • Jaccard, Edit distance

  • PartEnum (AGK06)

    • Hamming, Jaccard, Edit distance

  • LSH (GIM99)

    • Jaccard, Edit distance

  • Mismatch filter (XWL08)

    • Edit distance


Prefix filter l.jpg

3

4

5

7

8

9

10

12

13

1

2

6

11

14

q

s

Prefix Filter

  • Bit vectors:

  • Mismatch vector:

    s: matches 6, missing 2, extra 2

  • If |sq|6 then s’s s.t. |s’|3, |s’q|

  • For at least k matches, |s’| = l - k + 1


Using prefixes l.jpg
Using Prefixes

  • Take a random permutation of n-gram universe:

  • Take prefixes from both sets:

    • |s’|=|q’|=3, if |sq|6 then s’q’

11

14

8

2

3

4

5

10

12

6

9

1

7

13

q

s


Prefix filter for weighted sets l.jpg

t1

t2

t4

t6

t8

t11

t14

w1

w1

w2

w2

0

0

w4

w4

0

0

q

s

α

w(s)-α

s’

s/s’

Prefix Filter for Weighted Sets

  • For example:

    • Order n-grams by weight (new coordinate space)

    • Query: w(qs)=Σiqswi  τ

    • Keep prefix s’ s.t. w(s’)  w(s) - α

    • Best case: w(q/q’s/s’) = α

    • Hence, we need w(q’s’) τ-α

w1 w2  …  w14


Prefix filter properties l.jpg
Prefix Filter Properties

  • The larger we make α, the smaller the prefix

  • The larger we make α, the smaller the range of thresholds we can support:

    • Because τα, otherwise τ-α is negative.

  • We need to pre-specify minimum τ

  • Can apply to Jaccard, Edit Distance, IDF


Other signatures l.jpg
Other Signatures

  • Minhash (still to come)

  • PartEnum:

    • Upper bounds Hamming

    • Select multiple subsets instead of one prefix

    • Larger signature, but stronger guarantee

  • LSH:

    • Probabilistic with guarantees

    • Based on hashing

  • Mismatch filter:

    • Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance


Signature indexing l.jpg
Signature Indexing

  • Straightforward solution:

    • Create an inverted index on signature n-grams

    • Merge inverted lists to compute signature intersections

    • For a given string q:

      • Access only lists in q’

      • Find strings s with w(q’ ∩ s’) ≥ τ - α


The inverted signature hashtable ccvx08 l.jpg
The Inverted Signature Hashtable (CCVX08)

  • Maintain a signature vector for every n-gram

  • Consider prefix signatures for simplicity:

    • s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=…

    • co-occurence lists: ‘t L’: ‘tt ’  ‘t&t’  …

      ‘&tt’: ‘t L’  …

  • Hash all n-grams (h: n-gram  [0, m])

  • Convert co-occurrence lists to bit-vectors of size m


Example l.jpg

Hash

Signatures

lab

s’1

5

at&, la

s’2

at&

4

t&t, at&

s’3

t&t

5

t L, at&

s’4

t L

1

abo, t&t

s’5

la

0

t&t, la

Hashtable

at&

100011

t&t

010101

lab

t L

la

Example


Using the hashtable l.jpg

q

at&

lab

t&t

res

q’

1

1

1

0

at&

r

lab

1

1

0

1

p

Using the Hashtable?

  • Let list ‘at&’ correspond to bit-vector 100011

    • There exists string s s.t. ‘at&’  s’ and s’ also contains some n-grams that hash to 0, 1, or 5

  • Given query q:

    • Construct query signature matrix:

    • Consider only solid sub-matrices P: rq’, pq

    • We need to look only at rq’ such that w(r)τ-α and w(p)τ


Verification l.jpg
Verification

  • How do we find which strings correspond to a given sub-matrix?

    • Create an inverted index on string n-grams

    • Examine only lists in r and strings with w(s)τ

      • Remember that rq’

  • Can be used with other signatures as well


Outline15 l.jpg
Outline

  • Motivation and preliminaries

  • Inverted list based algorithms

  • Gram Signature algorithms

  • Length normalized algorithms

  • Selectivity estimation

  • Conclusion and future directions


Length normalized measures l.jpg
Length Normalized Measures

  • What is normalization?

    • Normalize similarity scores by the length of the strings.

      • Can result in more meaningful matches.

    • Can use L0 (i.e., the length of the string), L1, L2, etc.

    • For example L2:

      • Let w2(s)  Σtsw(t)2

      • Weight can be IDF, unary, language model, etc.

      • ||s||2 =w2(s)-1/2


The l 2 length filter hcks08 l.jpg
The L2-Length Filter (HCKS08)

  • Why L2?

    • For almost exact matches.

    • Two strings match only if:

      • They have very similar n-gram sets, and hence L2 lengths

      • The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).


Example18 l.jpg
Example

  • “AT&T Labs – Research”  L2=100

  • “ATT Labs – Research”  L2=95

  • “AT&T Labs”  L2=70

    • If “Research” happened to be very popular and had small weight?

  • “The Dark Knight”  L2=75

  • “Dark Night”  L2=72


Why l 2 continued l.jpg
Why L2 (continued)

  • Tight L2-based length filtering will result in very efficient pruning.

  • L2 yields scores bounded within [0, 1]:

    • 1 means a truly perfect match.

    • Easier to interpret scores.

    • L0 and L1 do not have the same properties

      • Scores are bounded only by the largest string length in the database.

      • For L0 an exact match can have score smaller than a non-exact match!


Example20 l.jpg
Example

  • q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’}  L0=5

  • s1={‘ATT’}  L0=1

  • s2=q   L0=5

  • S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2

  • S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2


Problems l.jpg
Problems

  • L2 normalization poses challenges.

    • For example:

      • S(q, s) = w2(qs)/(||q||2 ||s||2)

      • Prefix filter cannot be applied.

      • Minimum prefix weight α?

        • Value depends both on ||s||2 and ||q||2.

        • But ||q||2 is unknown at index construction time


Important l 2 properties l.jpg
Important L2 Properties

  • Length filtering:

    • For S(q, s) ≥ τ

    • τ||q||2 ||s||2  ||q||2 / τ

    • We are only looking for strings within these lengths.

    • Proof in paper

  • Monotonicity …


Monotonicity l.jpg
Monotonicity

  • Let s={t1, t2, …, tm}.

  • Let pw(s, t)=w(t) / ||s||2(partial weight of s)

  • Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)=

    Σtqspw(s, t) pw(q, t)

  • If pw(s, t) > pw(r, t):

    • w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2

  • Hence, for any t’  t:

    • w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)


Indexing l.jpg

id strings

at

ch

ck

ic

ri

st

ta

ti

tu

uc

0

1

2

3

4

rich

stick

stich

stuck

static

2-grams

2

3

3

4

3

1

0

4

2

1

0

3

1

0

4

1

2

4

4

2

Indexing

  • Use inverted lists sorted by pw():

  • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic) 

  • ||0||2 < ||4||2 < ||1||2 < ||2||2


L 2 length filter l.jpg

4

4

0

0

3

at

ch

ck

ic

ri

st

ta

ti

tu

uc

4

2

0

0

4

2

0

0

2

2

3

4

4

4

4

2

4

4

2

1

2

1

1

1

3

3

L2 Length Filter

  • Given q and τ, and using length filtering:

  • We examine only a small fraction of the lists


Monotonicity26 l.jpg

at

ch

ck

ic

ri

st

ta

ti

tu

uc

2

4

3

2

1

2

4

0

4

3

0

4

1

2

0

Monotonicity

  • If I have seen 1 already, then 4 is not in the list:

3

1

3

1

4


Other improvements l.jpg
Other Improvements

  • Use properties of weighting scheme

    • Scan high weight lists first

    • Prune according to string length and maximum potential score

    • Ignore low weight lists altogether


Conclusion l.jpg
Conclusion

  • Concepts can be extended easily for:

    • BM25

    • Weighted Jaccard

    • DICE

    • IDF

  • Take away message:

    • Properties of similarity/distance function can play big role in designing very fast indexes.

    • L2 super fast for almost exact matches


Outline29 l.jpg
Outline

  • Motivation and preliminaries

  • Inverted list based algorithms

  • Gram signature algorithms

  • Length-normalized measures

  • Selectivity estimation

  • Conclusion and future directions


The problem l.jpg
The Problem

  • Estimate the number of strings with:

    • Edit distance smaller than k from query q

    • Cosine similarity higher than τ to query q

    • Jaccard, Hamming, etc…

  • Issues:

    • Estimation accuracy

    • Size of estimator

    • Cost of estimation


Motivation l.jpg
Motivation

  • Query optimization:

    • Selectivity of query predicates

    • Need to support selectivity of approximate string predicates

  • Visualization/Querying:

    • Expected result set size helps with visualization

    • Result set size important for remote query processing


Flavors l.jpg
Flavors

  • Edit distance:

    • Based on clustering (JL05)

    • Based on min-hash (MBKS07)

    • Based on wild-card n-grams (LNS07)

  • Cosine similarity:

    • Based on sampling (HYKS08)


Selectivity estimation for edit distance l.jpg
Selectivity Estimation for Edit Distance

  • Problem:

    • Given query string q

    • Estimate number of strings s  D

    • Such that ed(q, s)  δ


Sepia clustering jl05 jlv08 l.jpg
Sepia - Clustering (JL05, JLV08)

  • Partition strings using clustering:

    • Enables pruning of whole clusters

  • Store per cluster histograms:

    • Number of strings within edit distance 0,1,…,δ from the cluster center

  • Compute global dataset statistics:

    • Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query


Edit vectors l.jpg
Edit Vectors

  • Edit distance is not discriminative:

    • Use Edit Vectors

    • 3D space vs 1D space

Ci

Luciano

<2,0,0>

2

<1,1,1>

3

Lukas

Lucia

pi

q

Lucas

<1,1,0>

2


Visually l.jpg

#

Edit Vector

Cn

pn

<0, 0, 0>

4

C1

p1

F1

<0, 0, 1>

12

<1, 0, 2>

7

#

Edit Vector

<0, 0, 0>

3

C2

F2

p2

<0, 1, 0>

40

<1, 0, 1>

6

v(q,pi)

v(pi,s)

ed(q,s)

#

%

<1, 0, 1>

<0, 0, 1>

1

1

14

2

<1, 0, 1>

<0, 0, 1>

4

57

#

Edit Vector

3

<1, 0, 1>

<0, 0, 1>

7

100

<0, 0, 0>

2

Fn

<1, 0, 2>

84

<1, 1, 0>

<1, 0, 2>

3

21

25

<1, 1, 1>

1

<1, 1, 0>

<1, 0, 2>

4

63

75

<1, 1, 0>

<1, 0, 2>

5

84

100

Visually

...

Global Table


Selectivity estimation l.jpg
Selectivity Estimation

  • Use triangle inequality:

    • Compute edit vector v(q,pi) for all clusters i

    • If |v(q,pi)|  ri+δ disregard cluster Ci

δ

ri

pi

q


Selectivity estimation38 l.jpg
Selectivity Estimation

  • Use triangle inequality:

    • Compute edit vector v(q,pi) for all clusters i

    • If |v(q,pi)|  ri+δ disregard cluster Ci

    • For all entries in frequency table:

      • If |v(q,pi)| + |v(pi,s)|  δ then ed(q,s)  δ for all s

      • If ||v(q,pi)| - |v(pi,s)||  δ ignore these strings

      • Else use global table:

        • Lookup entry <v(q,pi), v(pi,s), δ> in global table

        • Use the estimated fraction of strings


Example39 l.jpg

#

Edit Vector

<0, 0, 0>

4

F1

<0, 0, 1>

12

<1, 0, 2>

7

v(q,pi)

v(pi,s)

ed(q,s)

#

%

<1, 0, 1>

<0, 0, 1>

1

1

14

2

<1, 0, 1>

<0, 0, 1>

4

57

3

<1, 0, 1>

<0, 0, 1>

7

100

<1, 1, 0>

<1, 0, 2>

3

21

25

<1, 1, 0>

<1, 0, 2>

4

63

75

<1, 1, 0>

<1, 0, 2>

5

84

100

Example

  • δ =3

  • v(q,p1) = <1,1,0> v(p1,s) = <1,0,2>

  • Global lookup:

    [<1,1,0>,<1,0,2>, 3]

  • Fraction is 25% x 7 = 1.75

  • Iterate through F1, and add up contributions

Global Table


Slide40 l.jpg
Cons

  • Hard to maintain if clusters start drifting

  • Hard to find good number of clusters

    • Space/Time tradeoffs

  • Needs training to construct good dataset statistics table


Vsol minhash mbks07 l.jpg
VSol – minhash (MBKS07)

  • Solution based on minhash

  • minhash is used for:

    • Estimate the size of a set |s|

    • Estimate resemblance of two sets

      • I.e., estimating the size of J=|s1s2| / |s1s2|

    • Estimate the size of the union |s1s2|

    • Hence, estimating the size of the intersection

      • |s1s2| J~(s1, s2)  ~(s1, s2)


Minhash l.jpg
Minhash

  • Given a set s = {t1, …, tm}

  • Use independent hash functions h1, …, hk:

    • hi: n-gram  [0, 1]

  • Hash elements of s, k times

  • Keep the k elements that hashed to the smallest value each time

  • We reduced set s, from m to k elements

  • Denote minhash signature with s’


How to use minhash l.jpg
How to use minhash

  • Given two signatures q’, s’:

    • J(q, s) Σ1ik I{q’[i]=s’[i]} / k

    • |s|  ( k / Σ1ik s’[i] ) - 1

    • (qs)’ = q’  s’ = min1ik(q’[i], s’[i])

    • Hence:

      • |qs|  (k / Σ1ik (qs)’[i]) - 1


Vsol estimator l.jpg

t1

t2

t10

1

3

1

5

5

8

Inverted list

14

25

43

Minhash

VSol Estimator

  • Construct one inverted list per n-gram in D

    • The lists are our sets

  • Compute a minhash signature for each list


Selectivity estimation45 l.jpg
Selectivity Estimation

  • Use edit distance length filter:

    • If ed(q, s)  δ, then q and s share at least L = |s| - 1 - n (δ-1)

      n-grams

  • Given query q = {t1, …, tm}:

    • Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L)

    • We can estimate sizes of L-intersections using minhash signatures


Example46 l.jpg

q =

t1

t2

t10

1

3

1

5

5

8

14

25

43

Example

  • δ = 2, n = 3  L = 6

    • Look at all 6-intersections of inverted lists

    • Α = |ι1, ..., ι6  [1,10](ti1  ti2  …  ti6)|

    • There are (10 choose 6) such terms

Inverted list


The m l similarity l.jpg
The m-L Similarity

  • Can be done efficiently using minhashes

  • Answer:

    • ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] }

    • A  ρ  |t1… tm|

  • Proof very similar to the proof for minhashes


Slide48 l.jpg
Cons

  • Will overestimate results

    • Many L-intersections will share strings

    • Edit distance length filter is loose


Opteq wild card n grams lns07 l.jpg
OptEQ – wild-card n-grams (LNS07)

  • Use extended n-grams:

    • Introduce wild-card symbol ‘?’

    • E.g., “ab?” can be:

      • “aba”, “abb”, “abc”, …

  • Build an extended n-gram table:

    • Extract all 1-grams, 2-grams, …, n-grams

    • Generalize to extended 2-grams, …, n-grams

    • Maintain an extended n-grams/frequency hashtable


Example50 l.jpg

n-gram table

n-gram

Frequency

ab

10

Dataset

bc

15

string

de

4

ef

1

abc

gh

21

def

hi

2

ghi

?b

13

a?

17

?c

23

abc

5

def

2

Example


Query expansion replacements only l.jpg
Query Expansion (Replacements only)

  • Given query q=“abcd”

  • δ=2

  • And replacements only:

    • Base strings:

      • “??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”

    • Query answer:

      • S1={sD: s  ”??cd”}, S2=…

      • A = |S1 S2  S3  S4  S5  S6|=

        Σ1n6 (-1)n-1 |S1  …  Sn|


Replacement intersection lattice l.jpg
Replacement Intersection Lattice

A = Σ1n6 (-1)n-1 |S1  …  Sn|

  • Need to evaluate size of all 2-intersections, 3-intersections, …, 6-intersections

  • Then, use n-gram table to compute sum A

  • Exponential number of intersections

  • But ... there is well-defined structure


Replacement lattice l.jpg

ab?d

abcd

?b?d

a??d

ab??

?bcd

a?cd

abc?

??cd

?bc?

a?c?

Replacement Lattice

  • Build replacement lattice:

  • Many intersections are empty

  • Others produce the same results

    • we need to count everything only once

2 ‘?’

1 ‘?’

0 ‘?’


General formulas l.jpg
General Formulas

  • Similar reasoning for:

    • r replacements

    • d deletions

  • Other combinations difficult:

    • Multiple insertions

    • Combinations of insertions/replacements

  • But … we can generate the corresponding lattice algorithmically!

    • Expensive but possible


Basiceq l.jpg
BasicEQ

  • Partition strings by length:

    • Query q with length l

    • Possible matching strings with lengths:

      • [l-δ, l+δ]

    • For k = l-δ to l+δ

      • Find all combinations of i+d+r = δ and l+i-d=k

      • If (i,d,r) is a special case use formula

      • Else generate lattice incrementally:

        • Start from query base strings (easy to generate)

        • Begin with 2-intersections and build from there


Opteq l.jpg
OptEq

  • Details are cumbersome

    • Left for homework

  • Various optimizations possible to reduce complexity


Slide57 l.jpg
Cons

  • Fairly complicated implementation

  • Expensive

  • Works for small edit distance only


Hashed sampling hyks08 l.jpg
Hashed Sampling (HYKS08)

  • Used to estimate selectivity of TF/IDF, BM25, DICE (vector space model)

  • Main idea:

    • Take a sample of the inverted index

    • But do it intelligently to improve variance


Example59 l.jpg

0

0

at

ch

ck

ic

ri

st

ta

ti

tu

uc

2

4

2

4

1

1

1

4

3

0

4

2

0

1

4

4

3

3

2

2

1

1

2

3

Example

  • Take a sample of the inverted index


Example cont l.jpg

at

ch

ck

ic

ri

st

ta

ti

tu

uc

4

2

2

4

4

3

2

4

4

2

Example (Cont.)

  • But do it intelligently to improve variance

0

0

1

1

0

0

1

1

0

0

3

1

1

1

1

3

3


Construction l.jpg
Construction

  • Draw samples deterministically:

    • Use a hash function h: N  [0, 100]

    • Keep ids that hash to values smaller than σ

  • Invariant:

    • If a given id is sampled in one list, it will always be sampled in all other lists that contain it:

      • S(q, s) can be computed directly from the sample

      • No need to store complete sets in the sample

      • No need for extra I/O to compute scores


Selectivity estimation62 l.jpg
Selectivity Estimation

  • The union of arbitrary list samples is an σ% sample

  • Given query q = {t1, …, tm}:

    • A = |Aσ| |t1  …  tm| / |tσ1  …  tσm|:

      • Aσ is the query answer size from the sample

      • The fraction is the actual scale-up factor

      • But there are duplicates in these unions!

    • We need to know:

      • The distinct number of ids in t1  …  tm

      • The distinct number of ids in tσ1  …  tσm


Count distinct l.jpg
Count Distinct

  • Distinct |tσ1  …  tσm| is easy:

    • Scan the sampled lists

  • Distinct |t1  …  tm| is hard:

    • Scanning the lists is the same as computing the exact answer to the query … naively

    • We are lucky:

      • Each list sample doubles up as a k-minimum value estimator by construction!

      • We can use the list samples to estimate the distinct |t1  …  tm|


The k minimum value synopsis l.jpg

r

0

100

hr

hr r

100 ?

The k-Minimum Value Synopsis

  • It is used to estimated the distinct size of arbitrary set unions (the same as FM sketch):

    • Take hash function h: N  [0, 100]

    • Hash each element of the set

    • The r-th smallest hash value is an unbiased estimator of count distinct:


Outline65 l.jpg
Outline

  • Motivation and preliminaries

  • Inverted list based algorithms

  • Gram signature algorithms

  • Length normalized algorithms

  • Selectivity estimation

  • Conclusion and future directions


Future directions l.jpg
Future Directions

  • Result ranking

    • In practice need to run multiple types of searches

    • Need to identify the “best” results

  • Diversity of query results

    • Some queries have multiple meanings

    • E.g., “Jaguar”

  • Updates

    • Incremental maintenance


References l.jpg
References

  • [AGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006

  • [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu, ICDE 2009

  • [HCK+08] Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava: Fast Indexes and Algorithms for Set Similarity Selection Queries. ICDE 2008

  • [HYK+08] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 2008.

  • [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Liang Jin, and Chen Li. VLDB 2005.

  • [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. SIGMOD 2006.

  • [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.

  • [LNS07] Hongrae Lee, Raymond T. Ng, Kyuseok Shim: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. VLDB 2007

  • [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007

  • [MBK+07] Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava: Estimating the selectivity of approximate string queries. ACM TODS 2007

  • [XWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008


References68 l.jpg
References

  • [XWL+08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW 2008.

  • [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently, Xiaochun Yang, Bin Wang, and Chen Li, SIGMOD 2008

  • [JLV08]L. Jin, C. Li, R. Vernica: SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases, VLDBJ08

  • [CGK06] S. Chaudhuri, V. Ganti, R. Kaushik : A Primitive Operator for Similarity Joins in Data Cleaning, ICDE06

  • [CCGX08]K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin: An Efficient Filter for Approximate Membership Checking, SIGMOD08

  • [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754

  • [BK02] Jérémy Barbay, Claire Kenyon: Adaptive intersection and t-threshold problems. SODA 2002: 390-399

  • [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920


ad