1 / 51

Answering Approximate Queries Efficiently

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica. Answering Approximate Queries Efficiently. 30,000-Foot View of Info Systems. Data Repository (RDBMS, Search Engines, etc.). Answers matching conditions. Query. Tom.

Download Presentation

Answering Approximate Queries Efficiently

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently

  2. 30,000-Foot View of Info Systems Data Repository (RDBMS, Search Engines, etc.) Answers matching conditions Query

  3. Tom Find movies starred Samuel Jackson Example: a movie database

  4. How about our governor: Schwarrzenger? The user doesn’t know the exact spelling!

  5. Relaxing Conditions Find movies with a star “similar to” Schwarrzenger.

  6. Errors in the database: • Data often is not clean by itself • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson In general: Gap between Queries and Facts Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string

  7. “Did you mean…?” features in Search Engines

  8. What if we don’t want the user to change the query?Answering Queries Approximately Data Repository (RDBMS, Search Engines, etc.) Answers matching conditions approximately Query

  9. Technical Challenges • How to relax conditions? • Name: “Schwarzenegger” vs “Schwarrzenger” • Salary: “in [50K,60K]” vs “in [49K,63K]” • How to answer queries efficiently? • Index structures • Selectivity estimation See our three recent VLDB papers

  10. Rest of the talk • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Construction and maintenance of SEPIA • Experiments • Other works

  11. Queries with Fuzzy String Predicates • Stars: name similar to “Schwarrzenger” • Employees: SSN similar to “430-87-7294” • Customers: telephone number similar to “412-0964” Database • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance: ed(Schwarrzenger, Schwarzenegger)=2 • Cosine similarity • Jaccard coefficient distance • Soundex • …

  12. Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

  13. Selectivity of Fuzzy Predicates star SIMILARTO ’Schwarrzenger’ • Selectivity: # of records satisfying the predicate

  14. Selectivity Estimation: Problem Formulation star SIMILARTO ’Schwarrzenger’ Input: fuzzy string predicate P(q, δ) A bag of strings Output: # of strings s that satisfy dist(s,q) <= δ

  15. Why Selectivity Estimation? SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1970,1971]; SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1980,1989]; Movies The optimizer needs to know the selectivity of a predicate to decide a good plan.

  16. Using traditional histograms? • No “nice” order for strings • Lexicographical order? • Similar strings could be far from each other: Kammy/Cammy • Adjacent strings have different selectivities: Cathy/Catherine

  17. Outline • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works

  18. Our approach: SEPIA Selectivity Estimation of Approximate Predicates Intuition

  19. Proximity between Strings Edit Distance? Not discriminative enough

  20. Edit Vector from s1 to s2 • A vector <I, D, S> • I: # of insertions • D: # of deletions • S: # of substitutions in a sequence of edit operations with their edit distance • Easily computable • Not symmetric • Not unique, but tend to be (ed <= 3  91% unique)

  21. Why Edit Vector? More discriminative

  22. SEPIA histograms: Overview

  23. Frequency table for each cluster

  24. Global PPD Table Proximity Pair Distribution table

  25. SEPIA histograms: summary

  26. Selectivity Estimation: ed(lukas, 2) • Do it for all v2 vectors in each cluster, for all clusters • Take the sum of these contributions

  27. Selectivity Estimation for ed(q,d) • For each cluster Ci • For each v2 in frequency table of Ci • Use (v1,v2,d) to lookup PPD • Take the sum of these f * N • Pruning possible (triangle inequality)

  28. Outline • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works

  29. Clustering Strings Two example algorithms • Lexicographic order based. • K-Medoids • Choose initial pivots • Assign strings to its closest pivot • Swap a pivot with another string • Reassign the strings

  30. Number of Clusters It affects: • Cluster quality • Similarity of strings within each cluster • Costs: • Space • Estimation time

  31. Constructing Frequency Tables • For each cluster, group strings based on their edit vector from the pivot • Count the frequency for each group

  32. Constructing PPD Table • Get enough samples of string triplets (q,p,s) • Propose a few heuristics • ALL_RAND • CLOSE_RAND • CLOSE_LEX • CLOSE_UNIQUE

  33. Dynamic Maintenance: Frequency Table Take insertion as an example

  34. Dynamic Maintenance: PPD

  35. Improving Estimation Accuracy • Reasons of estimate errors • Miss hits in PPD. • Inaccurate percentage entries in PPD. • Improvement: use sample fuzzy predicates to analyze their estimation errors

  36. Relative-Error Model • Use the errors to build a model • Use the model to adjust initial estimation

  37. Outline • Motivation: selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works

  38. Data • Citeseer: • 71K author names • Length: [2,20], avg = 12 • Movie records from UCI KDD repository: • 11K movie titles. • Length: [3,80], avg = 35 • Introduced duplicates: • 10% of records • # of duplicates: [1,20], uniform • Final results: • Citeseer: 142K author names • UCI KDD: 23K movie titles

  39. Setting • Test bed • PC: 2.4G P4, 1.2GB RAM, Windows XP • Visual C++ compiler • Query workload: • Strings from the data • String not in the data • Results similar • Quality measurements • Relative error: (fest – freal) / freal • Absolute relative error : |fest – freal | / freal

  40. Clustering Algorithms K-Metoids is better

  41. Quartile distribution of relative errors Data set 1. CLOSE_RAND; 1000 clusters

  42. Number of Clusters

  43. Effectiveness of Applying Relative-Error Model

  44. Dynamic Maintenance

  45. Other work 1: Relaxing SQL queries with Selections/Joins SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5

  46. Query Relaxation: Skyline!

  47. Other work 2: Fuzzy predicates on attributes of mixed types SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3; Movies

  48. Mixed-Typed Predicates • String attributes: edit distance • Numeric attributes: absolute numeric difference SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3;

  49. MAT-tree: Intuition • Indexing on two attributes is more effective than two separate indexing structures • Numeric attribute: B-tree • String attribute: tree-based index structure?

  50. MAT-tree: Overview • Tree-based indexing structure: • Each node has MBR for both numeric attribute and string attribute • Compressing strings as a “compressed trie” that fits into a limited space • An edit distance between a string and compressed trie can be computed • Experiments show that MAT-tree is very efficient

More Related