slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Answering Approximate Queries Efficiently PowerPoint Presentation
Download Presentation
Answering Approximate Queries Efficiently

Loading in 2 Seconds...

play fullscreen
1 / 51

Answering Approximate Queries Efficiently - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica. Answering Approximate Queries Efficiently. 30,000-Foot View of Info Systems. Data Repository (RDBMS, Search Engines, etc.). Answers matching conditions. Query. Tom.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Answering Approximate Queries Efficiently' - poppy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Chen Li

Department of Computer Science

Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica

Answering Approximate Queries Efficiently

30 000 foot view of info systems
30,000-Foot View of Info Systems

Data Repository (RDBMS, Search Engines, etc.)

Answers matching conditions

Query

example a movie database

Tom

Find movies starred Samuel Jackson

Example: a movie database
how about our governor schwarrzenger
How about our governor: Schwarrzenger?

The user doesn’t know the exact spelling!

relaxing conditions
Relaxing Conditions

Find movies with a star “similar to” Schwarrzenger.

in general gap between queries and facts

Errors in the database:

    • Data often is not clean by itself
    • Especially true in data integration and cleansing

Relation S

Relation R

Star

Star

Keanu Reeves

Keanu Reeves

Samuel Jackson

Samuel L. Jackson

In general: Gap between Queries and Facts

Schwarzenegger

Schwarzenegger

Samuel Jackson

Samuel L. Jackson

  • Errors in the query
    • The user doesn’t remember a string exactly
    • The user unintentionally types a wrong string
what if we don t want the user to change the query answering queries approximately
What if we don’t want the user to change the query?Answering Queries Approximately

Data Repository (RDBMS, Search Engines, etc.)

Answers matching conditions approximately

Query

technical challenges
Technical Challenges
  • How to relax conditions?
    • Name: “Schwarzenegger” vs “Schwarrzenger”
    • Salary: “in [50K,60K]” vs “in [49K,63K]”
  • How to answer queries efficiently?
    • Index structures
    • Selectivity estimation

See our three recent VLDB papers

rest of the talk
Rest of the talk
  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works
queries with fuzzy string predicates
Queries with Fuzzy String Predicates
  • Stars: name similar to “Schwarrzenger”
  • Employees: SSN similar to “430-87-7294”
  • Customers: telephone number similar to “412-0964”

Database

  • Similar to:
    • a domain-specific function
    • returns a similarity value between two strings
  • Examples:
    • Edit distance: ed(Schwarrzenger, Schwarzenegger)=2
    • Cosine similarity
    • Jaccard coefficient distance
    • Soundex
example similarity function edit distance
Example Similarity Function: Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2
  • Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

selectivity of fuzzy predicates
Selectivity of Fuzzy Predicates

star SIMILARTO ’Schwarrzenger’

  • Selectivity: # of records satisfying the predicate
selectivity estimation problem formulation
Selectivity Estimation: Problem Formulation

star SIMILARTO ’Schwarrzenger’

Input: fuzzy string predicate P(q, δ)

A bag of strings

Output: # of strings s that satisfy dist(s,q) <= δ

why selectivity estimation
Why Selectivity Estimation?

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1970,1971];

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1980,1989];

Movies

The optimizer needs to know the selectivity of a predicate to decide a good plan.

using traditional histograms
Using traditional histograms?
  • No “nice” order for strings
  • Lexicographical order?
    • Similar strings could be far from each other: Kammy/Cammy
    • Adjacent strings have different selectivities: Cathy/Catherine
outline
Outline
  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works
our approach sepia
Our approach: SEPIA

Selectivity Estimation of Approximate Predicates

Intuition

proximity between strings
Proximity between Strings

Edit Distance? Not discriminative enough

edit vector from s1 to s2
Edit Vector from s1 to s2
  • A vector <I, D, S>
    • I: # of insertions
    • D: # of deletions
    • S: # of substitutions

in a sequence of edit operations with their edit distance

  • Easily computable
  • Not symmetric
  • Not unique, but tend to be (ed <= 3  91% unique)
why edit vector
Why Edit Vector?

More discriminative

global ppd table
Global PPD Table

Proximity Pair Distribution table

selectivity estimation ed lukas 2
Selectivity Estimation: ed(lukas, 2)
  • Do it for all v2 vectors in each cluster, for all clusters
  • Take the sum of these contributions
selectivity estimation for ed q d
Selectivity Estimation for ed(q,d)
  • For each cluster Ci
  • For each v2 in frequency table of Ci
  • Use (v1,v2,d) to lookup PPD
  • Take the sum of these f * N
  • Pruning possible (triangle inequality)
outline1
Outline
  • Selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works
clustering strings
Clustering Strings

Two example algorithms

  • Lexicographic order based.
  • K-Medoids
    • Choose initial pivots
    • Assign strings to its closest pivot
    • Swap a pivot with another string
    • Reassign the strings
number of clusters
Number of Clusters

It affects:

  • Cluster quality
    • Similarity of strings within each cluster
  • Costs:
    • Space
    • Estimation time
constructing frequency tables
Constructing Frequency Tables
  • For each cluster, group strings based on their edit vector from the pivot
  • Count the frequency for each group
constructing ppd table
Constructing PPD Table
  • Get enough samples of string triplets (q,p,s)
  • Propose a few heuristics
    • ALL_RAND
    • CLOSE_RAND
    • CLOSE_LEX
    • CLOSE_UNIQUE
dynamic maintenance frequency table
Dynamic Maintenance: Frequency Table

Take insertion as an example

improving estimation accuracy
Improving Estimation Accuracy
  • Reasons of estimate errors
    • Miss hits in PPD.
    • Inaccurate percentage entries in PPD.
  • Improvement: use sample fuzzy predicates to analyze their estimation errors
relative error model
Relative-Error Model
  • Use the errors to build a model
  • Use the model to adjust initial estimation
outline2
Outline
  • Motivation: selectivity estimation of fuzzy predicates
  • Our approach: SEPIA
    • Overview
    • Proximity between strings
    • Estimation algorithm
  • Construction and maintenance of SEPIA
  • Experiments
  • Other works
slide38
Data
  • Citeseer:
    • 71K author names
    • Length: [2,20], avg = 12
  • Movie records from UCI KDD repository:
    • 11K movie titles.
    • Length: [3,80], avg = 35
  • Introduced duplicates:
    • 10% of records
    • # of duplicates: [1,20], uniform
  • Final results:
    • Citeseer: 142K author names
    • UCI KDD: 23K movie titles
setting
Setting
  • Test bed
    • PC: 2.4G P4, 1.2GB RAM, Windows XP
    • Visual C++ compiler
  • Query workload:
    • Strings from the data
    • String not in the data
    • Results similar
  • Quality measurements
    • Relative error: (fest – freal) / freal
    • Absolute relative error : |fest – freal | / freal
clustering algorithms
Clustering Algorithms

K-Metoids is better

quartile distribution of relative errors
Quartile distribution of relative errors

Data set 1. CLOSE_RAND; 1000 clusters

other work 1 relaxing sql queries with selections joins
Other work 1: Relaxing SQL queries with Selections/Joins

SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5

other work 2 fuzzy predicates on attributes of mixed types
Other work 2: Fuzzy predicates on attributes of mixed types

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

Movies

mixed typed predicates
Mixed-Typed Predicates
  • String attributes: edit distance
  • Numeric attributes: absolute numeric difference

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

mat tree intuition
MAT-tree: Intuition
  • Indexing on two attributes is more effective than two separate indexing structures
  • Numeric attribute: B-tree
  • String attribute: tree-based index structure?
mat tree overview
MAT-tree: Overview
  • Tree-based indexing structure:
    • Each node has MBR for both numeric attribute and string attribute
  • Compressing strings as a “compressed trie” that fits into a limited space
  • An edit distance between a string and compressed trie can be computed
  • Experiments show that MAT-tree is very efficient
conclusion
Conclusion
  • It’s important to support answering approximate queries efficiently
  • Our results so far:
    • SEPIA: provides accurate selectivity estimation for fuzzy string predicates
    • Relaxing SQL queries with selections and joins
    • MAT-tree: indexing structure supporting fuzzy queries with mixed-types predicates