Can we beat the prefix filtering an adaptive framework for similarity join and search
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search. Jiannan Wang (Tsinghua University) Guoliang Li (Tsinghua University) Jianhua Feng (Tsinghua University). Data Integration. Data Cleaning. Similarity Join. Jaccard :. Threshold: 0.6. Challenge.

Download Presentation

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Can we beat the prefix filtering an adaptive framework for similarity join and search

Can We Beat the Prefix Filtering?An Adaptive Framework for Similarity Join and Search

Jiannan Wang(Tsinghua University)

GuoliangLi (Tsinghua University)

JianhuaFeng (Tsinghua University)


Data integration

Data Integration


Data cleaning

Data Cleaning


Similarity join

Similarity Join

Jaccard:

Threshold: 0.6


Challenge

Challenge

Naïve Method

How to address?

Filtering and Verification


Prefix filtering

Prefix Filtering

[1] Chaudhuri et al. A primitive operator for similarity joins in data cleaning. ICDE 2006.

[2] Bayardo et al. Scaling up all pairs similarity search. WWW 2007.

[3] Xiao et al. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008.

[4] Xiao et al. Efficient similarity joins for near duplicate detection. WWW 2008.

[5] Xiao et al. Top-k set similarity joins. ICDE 2009.

[6] Vernicaet al. Efficient parallel set-similarity joins using MapReduce. SIGMOD 2010.

[7] Qin et al. Efficient exact edit similarity query processing with the asymmetric signature scheme. SIGMOD 2011


Overlap similarity

Overlap Similarity

Edit Distance

Cosine

Jaccard

Given two collections of objects, and ,

how to find such that

Edit Similarity

Dice

Overlap


Prefix filtering1

Prefix Filtering

, , , ,

, , , ,

?


Prefix filtering2

Prefix Filtering

, , , ,

, , , ,


Prefix filtering3

Prefix Filtering

Elements are sorted based on a global ordering

, , , ,

, , , ,

?


Prefix filtering4

Prefix Filtering

Find such that 4 ?

Sort the elements of each set based on a global ordering


Prefix filtering5

Prefix Filtering

Find such that 4 ?

Remove the last 3 elements in each set


Inverted index

Inverted Index

Find such that ?

Build inverted index on

Candidates


Can we beat the prefix filtering an adaptive framework for similarity join and search

Can we beat the prefix filtering?


Prefix scheme

Prefix Scheme

2-prefix scheme

1-prefix scheme

If then

If then

, ,, ,

, ,, ,

, ,, ,

, ,, ,

() can be filtered

() cannot be filtered


Cost analysis

Cost Analysis

2-prefix scheme

1-prefix scheme

, ,, ,

, ,, ,


Cost analysis1

Cost Analysis

2-prefix scheme

1-prefix scheme

, ,, ,

, ,, ,


Cost analysis2

Cost Analysis

2-prefix scheme

1-prefix scheme

, ,, ,

, ,, ,

Filtering: 2+2+2

Verification:1*10

Total: 16

Filtering: 2+2

Verification:4*10

Total: 44


Experimental analysis

Experimental Analysis

  • DBLP


Can we beat the prefix filtering an adaptive framework for similarity join and search

An adaptive framework for similarity Join and Search


Variable length prefix scheme

Variable-Length Prefix Scheme

Find such that 4 ?

Cost analysis


Adaptive framework

Adaptive Framework

  • Step 1: Build an inverted index I to support variable-length prefix scheme

  • Step 2: For each

    • Step 2.1: Adaptively select -prefix scheme for r

    • Step 2.2: Utilize -prefix scheme to find objects from S that is similar with r

Challenge 1

Challenge 2


Challenge 1 delta inverted index

Challenge 1: Delta Inverted Index

1-prefix scheme

2-prefix scheme

. . .


Challenge 2 adaptively selecting prefix scheme

Challenge 2: Adaptively Selecting Prefix Scheme

①;

②Compare -prefix scheme with -prefix scheme;

  • If-prefix scheme is betterthen

    Choose-prefix scheme;

  • Else

    ++;

    Goto②;

How


Challenge 2 adaptively selecting prefix scheme1

Challenge 2: Adaptively Selecting Prefix Scheme

, ,, ,


Challenge 2 adaptively selecting prefix scheme2

Challenge 2: Adaptively Selecting Prefix Scheme

, ,, ,


Estimate

Estimate

  • : the #candidates for 1-prefix scheme

  • : the #candidates for 2-prefix scheme

We merge blue lists in advance to obtain


Estimate1

Estimate

Occur at least twice in blue lists and green lists

+

Occur at least twice in blue lists

Occur only once in blue lists and at least once in green lists

  • Random sampling

  • Let P be the probability that s occur only once in blue lists

  • The value is

  • Estimate P by random sampling

The value has already known when estimating


Similarity search

Similarity Search

  • Different from Similarity Join

    • A threshold is not specified when building an index from data

Query: , ,, , ,

Data:

Answer:


Experiment setup

Experiment Setup

  • Dataset statistics

  • Existing techniques


Similarity join1

Similarity Join


Similarity search1

Similarity Search


Conclusion

Conclusion

  • Different prefix schemes lead to significantly different performance, and prefix filtering (1-prefix scheme) did not always achieve high performance

  • An adaptive framework for similarity join and similarity search

  • Experimental results show that our adaptive framework outperforms the prefix-filtering framework and achieves higher performance than the state-of-the-art methods

  • Future Work


Can we beat the prefix filtering an adaptive framework for similarity join and search

Thanks!

Q&A

http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/adapt/


  • Login