1 / 26

An Overview of Similarity Query Processing

An Overview of Similarity Query Processing. 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부. Table of Contents. 01. Applications of similarity query processing 02. Problem Formulation 03. string Decomposition 04. Similarity Function 05. A naïve approach 06. Overlap Similarity

oren
Download Presentation

An Overview of Similarity Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부

  2. Table of Contents • 01. Applications of similarity query processing • 02. Problem Formulation • 03. string Decomposition • 04. Similarity Function • 05. A naïve approach • 06. Overlap Similarity • 07. Similarity Query Processing with Inverted lists • 08. Similarity Function Revisited • 09. Filter and Verification Framework • 10. Prefix Filtering based Approach • 11. Exploiting Document Frequency Ordering

  3. Some examples and figures in this presentationare taken from the following materials MariosHadjieleftheriou and Chen Li, Efficient Approximate Search on String Collections (tutorial), ICDE 2009 and VLDB 2009 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Efficient Similarity Joins for Near Duplicate Detection, WWW 2008 (slide) Jongik Kim and Hongrae Lee, Efficient Exact Similarity Searches using Multiple Token Orderings, ICDE 2012 (slide)

  4. Applications of similarity query processing (1/8) Web Search Actual queries gatheredby Google

  5. Applications of similarity query processing (2/8) Data Integration and data cleaning Should be “Niels Bohr” R S

  6. Applications of similarity query processing (3/8) Duplicate (Web) Documents Detection

  7. Applications of similarity query processing (4/8) Identify Spams SPAM TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION>

  8. Applications of similarity query processing (5/8) Detect Plagiarism Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.

  9. Applications of similarity query processing (6/8) Recommendation of friends in an SNS service Friends of a person can be representation of a binary vector Friends vector: 1 0 0 1 1 0 0 1 Friends vector: 1 0 0 1 1 1 0 1

  10. Applications of similarity query processing (7/8) Read (a fragment of genome sequence) Alignment Reference sequence GCTGATGTGCCGCCTCACTCCGGTGG … CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA Short reads GATGTGCCACCTCACTC GTGCCGCCTCACTCCTG CTCCTGTGG

  11. Applications of similarity query processing (8/8) Query Relaxation • Supported by Oracle Text • CREATE TABLEengdict(word VARCHAR(20), len INT); • Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / • CREATE INDEXfuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); • Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; • Limitation: cannot handle errors in the first letters: Katherine versus Catherine

  12. Problem Formulation (1/2) Find strings similar to a given string

  13. Problem Formulation (2/2) • Similar to: • a domain-specific function • returns a similarity value between two strings • Common similarity functions: • Jaccard coefficient • Cosine similarity • Dice similarity • Edit distance Functions require set data

  14. String Decomposition • Word tokens for long string (e.g. web page) • x= “yes as soon as possible” • y = “as soon as possible please” • x = {A, B, C, D, E} • y = {B, C, D, E, F} • q-gram tokens for short string (e.g. keyword query) • x= “universal” • G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} u n i v e r s a l

  15. Similarity Function x = {A, B, C, D, E} y = {B, C, D, E, F} • Jaccard Similarity • Cosine similarity • Dice similarity • Edit Distance ED(x, y) = minimum number of edit operationsto change x to y (insertion, deletion, substitution) • x: Tom Hanks • y: Ton Hank • ED(x, y) = 2

  16. A naïve approach Given a collection of strings C, a query string x, and a threshold t of a similarity function sim, 1. decompose each string in C and the query string into tokens. 2. output those string y∈C such that sim(x, y) ≥ t. Since C contains a lot of strings, this approach is obviously inefficient.

  17. Overlap Similarity (1/2) Overlap Similarity Given a similarity threshold t,

  18. Overlap Similarity (2/2) Given an edit distance d, d edit operations could affect d xqgrams • or, d edit operations on x can mutate dx q grams of x u n i v e r s a l x = “universal” and G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} 2 edit operations on x mutate 2 x 2 q-grams Hence, y should contains at least |G(x, 2)| - 2 x 2 = 4 q-grams in G(x, 2)

  19. Similarity Query Processing with Inverted lists an 2 ar 1 2 3 sk 4 ar 1 ea Make Inverted Lists ar is 2 4 3 ar re 1 rt 2 3 sa 2 st 3 ti 2 3 4 rt ar st ti is Merge to count occurrences Query: “artist”  Overlap threshold: 4 { , , , , } 2 1 Answers of the query 2: “artisan” 3: “artist” 4 2 5 3 2 4

  20. Merge Algorithm – HeapMerge 1: count 2 < t (X) 2: count 3 = t (O) 2 1 minHeap 2 1 2 … 3 3 1 2 3 4 1 3 3 7 3 2 2 17 17 Count threshold t≥ 3

  21. Similarity Function Revisited To determine the overlap threshold, we need to know the size of y, which varies according to each string in a collection. Given a query x with a similarity threshold t, FOR ALL y,

  22. Filter and Verification Framework VERIFICATION FILTER Find those strings that shares at least α tokens with the query string, where α is an overlap lower bound. Verify each string found in filtering stage by directly applying a similarity function FILTER REFINEMENT Quickly generate initial candidates using a minimum constraint Refine candidates using α

  23. Prefix Filtering based Approach Query x = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 Prefix Lists: the first |G(x, 2)| – α + 1 lists Inverted lists for the query Sort the listsby their sizes Sort the tokens by theirdocument frequencies ar 1 2 3 is 2 4 3 rt 2 3 st 3 is 2 3 4 st 3 rt 3 2 ar 1 2 3 ti 2 3 4 Document frequencyordering ti 2 3 4 Suffix Lists: remaining α – 1 lists • Filtering Phase (the prefix filtering) • Merge the prefix lists to generate candidates 1 2 3 4 2 candidates 2 3 4 5 3 • Refinement Phase • Search the suffix lists for each candidate • A candidate searches each suffix list to identify if it is contained in the list • Binary search is used because suffix lists are usually very long

  24. Exploiting Document Frequency Ordering (1/2) • General Goal: minimize the number of candidates initially generated • by making use of the document frequency ordering Query x = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 Prefix Lists: the first |G(x, 2)| – α+ 1 lists Prefix Lists: the first |G(x, 2)| – α + 1 lists st 3 3 rt 2 ar 1 2 3 Sort the tokens by theirdocument frequencies is 2 4 3 is 2 3 4 rt 2 3 ar 1 2 3 st 3 ti 2 3 4 ti 2 3 4 Suffix Lists: remaining α – 1 lists Suffix Lists: remaining α – 1 lists • We can reduce • time for merging short lists • number of candidates time for verification candidates 1 2 candidates 2 candidates 3 3 4

  25. Exploiting Document Frequency Ordering (2/2) • Observation • By partitioning a data set, we can artificially modify document frequencies of tokens in each partition. • We evaluate a query in each partition and take the union of the results. • We can reduce the number of candidates by utilizing different token orderings among partitions. • Because partitions have different token orderings, we need to sort tokens in a query record in each partition. Query x = {w1, w2} and overlap threshold α = 2 w2 is the prefix list # of candidates is 0 Partition w1 is the prefix list # of candidates is 0 w2 is the prefix list # of candidates is 5 Total number of candidates is 0

  26. Q&A Thank you!

More Related