query processing over incomplete autonomous databases n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Query Processing over Incomplete Autonomous Databases PowerPoint Presentation
Download Presentation
Query Processing over Incomplete Autonomous Databases

Loading in 2 Seconds...

play fullscreen
1 / 27

Query Processing over Incomplete Autonomous Databases - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Query Processing over Incomplete Autonomous Databases. Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona State University) Jianchun Fan (Amazon) Yi Chen (Arizona State University) Subbarao Kambhampati (Arizona State University). Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Query Processing over Incomplete Autonomous Databases' - zohar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
query processing over incomplete autonomous databases

Query Processing over Incomplete Autonomous Databases

Garrett Wolf (Arizona State University)

Hemal Khatri (MSN Live Search)

Bhaumik Chokshi (Arizona State University)

Jianchun Fan (Amazon)

Yi Chen (Arizona State University)

Subbarao Kambhampati (Arizona State University)

Query Processing over Incomplete Autonomous Databases

introduction
Introduction
  • More and more data is becoming accessible via web servers which are supported by backend databases
    • E.g. Cars.com, Realtor.com, Google Base, Etc.

Query Processing over Incomplete Autonomous Databases

incompleteness in web databases
Incompleteness in Web Databases
  • Inaccurate Extraction / Recognition
  • Incomplete Entry
  • Heterogeneous Schemas
  • User-defined Schemas

Query Processing over Incomplete Autonomous Databases

problem

Want a ‘Honda Accord’ with a ‘sedan’ body style for under ‘$12,000’

Problem
  • Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints.

High Precision

Low Recall

How to retrieve relevant uncertain results in a ranked fashion?

Many entities corresponding to tuples with missing values might be relevant to the user query

Query Processing over Incomplete Autonomous Databases

possible na ve approaches
Possible Naïve Approaches

Query Q: (Body Style = Convt)

1. CERTAINONLY: Return certain answers only as in traditional databases, namely those having Body Style = Convt

Low Recall

2. ALLRETURNED: Null matches any concrete value, hence return all answers having Body Style = Convt along with answers having body style as null

Low Precision, Infeasible

3. ALLRANKED: Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user

Costly, Infeasible

Query Processing over Incomplete Autonomous Databases

outline
Outline
  • Core Techniques
  • Peripheral Techniques
  • Implementation & Evaluation
  • Conclusion & Future Work

Query Processing over Incomplete Autonomous Databases

the qpiad solution
The QPIAD Solution

Given a query Q:( Body=Convt ) retrieve all relevant tuples

Base Result Set

LEARN

AFD: Model~> Body style

Re-order queries based on Estimated Precision

  • Select Top K Rewritten Queries
    • Q1’: Model=A4
    • Q2’: Model=Z4
    • Q3’: Model=Boxster

RANK

Ranked Relevant Uncertain Answers

REWRITE

EXPLAIN

Query Processing over Incomplete Autonomous Databases

slide8

LEARN

REWRITE

RANK

EXPLAIN

Query Processing over Incomplete Autonomous Databases

learning statistics to support ranking rewriting

LEARN – REWRITE – RANK – EXPLAIN

{Ai,…,Ak} ~>Am 0<conf<=1

Learning Statistics to Support Ranking & Rewriting
  • Value Distributions - Naïve Bayes Classifiers (NBC)
  • What is hard?
    • Learning correlations useful for rewriting
    • Efficiently assessing the probability distribution
    • Cannot modify the underlying autonomous sources
  • Attribute Correlations - Approximate Functional Dependencies (AFDs) & Approximate Keys (AKeys)

Make, Body ~> Model

EstPrec(Q|R) = (Am=vm|dtrSet(Am))

P(Model=Accord | Make=Honda, Body=Coupe)

Query Processing over Incomplete Autonomous Databases

rewriting to retrieve relevant uncertain results

LEARN – REWRITE – RANK – EXPLAIN

Rewriting to Retrieve Relevant Uncertain Results
  • What is hard?
    • Retrieving relevant uncertain tuples with missing values
    • Rewriting queries given the limited query access patterns

Base Set for Q:(Body=Convt)

AFD: Model~> Body

  • Given an AFD and Base Set, it is likely that tuples with
      • Model of A4, Z4 or Boxster
      • Body of NULL

are actually convertible.

  • Generate rewritten queries for each distinct Model:
    • Q1’: Model=A4
    • Q2’: Model=Z4
    • Q3’: Model=Boxster

Query Processing over Incomplete Autonomous Databases

selecting ordering top k rewritten queries

LEARN – REWRITE – RANK – EXPLAIN

Selecting/Ordering Top-K Rewritten Queries
  • What is hard?
    • Retrieving precise, non-empty result sets
    • Working under source-imposed resource limitations
  • Select top-k queries based on F-Measure

P – Estimated Precision

R – Estimated Recall

  • Reorder queries based on Estimated precision

All tuples returned for a single query are ranked equally

  • Retrieves tuples in order of their final ranks
  • No need to re-rank tuples after retrieving them!

Query Processing over Incomplete Autonomous Databases

explaining results to the users

LEARN – REWRITE – RANK – EXPLAIN

Explaining Results to the Users
  • What is hard?
    • Gaining the user’s trust
    • Generating meaningful explanations

Make, Body ~> Model

yields

This car is 83% likely to have Model=Accord given that its Make=Honda and Body=Sedan

Explanations based on AFDs.

Provide to the user:

  • Certain Answers
  • Relevant Uncertain Answers
  • Explanations

Query Processing over Incomplete Autonomous Databases

outline1
Outline
  • Core Techniques
  • Peripheral Techniques
  • Implementation & Evaluation
  • Conclusion & Future Work

Query Processing over Incomplete Autonomous Databases

leveraging correlation between data sources

Base Set +

AFD (Model  Body)

Q:(Model=Civic)

Q:(Body=Coupe)

Q:(Model=Corvette)

UNION

LS(Body, Make, Model, Year, Price, Mileage)

LS(Make, Model, Year, Price, Mileage)

Leveraging Correlation between Data Sources

AFDs learned from Cars.com

Q:(Body=Coupe)

Mediator: GS(Body, Make, Model, Year, Price, Mileage)

Two main uses:

Source doesn’t support all the attributes in GS

Sample/statistics aren’t available

Query Processing over Incomplete Autonomous Databases

handling aggregate and join queries

Include a portion of each tuple relative to the probability its missing value matches the query constraint

t1 + t3 + .9(t2)+.4(t4) =3.3

Count(*) = 3.3

t1 + t3 + t2 = 3

Count(*) = 3

Only include tuples whose most likely missing value matches the query constraint

Make=Honda

Handling Aggregate and Join Queries
  • Aggregate Queries

Q:(Count(*)

Where Body=Convt)

  • Join Queries

Query Processing over Incomplete Autonomous Databases

outline2
Outline
  • Core Techniques
  • Peripheral Techniques
  • Implementation & Evaluation
  • Conclusion & Future Work

Query Processing over Incomplete Autonomous Databases

qpiad web interface
QPIAD Web Interface

http://rakaposhi.eas.asu.edu/qpiad

Query Processing over Incomplete Autonomous Databases

empirical evaluation
Empirical Evaluation
  • Datasets:
    • Cars
      • Cars.com
      • 7 attributes, 55,000 tuples
    • Complaints
      • NHSTA Office of Defect Investigation
      • 11 attributes, 200,000 tuples
    • Census
      • US Census dataset, UCI repository
      • 12 attributes, 45,000 tuples
  • Sample Size:
    • 3-15% of full database
  • Incompleteness:
    • 10% of tuples contain missing values
    • Artificially introduced null values in order to compare with the ground truth
  • Evaluation:
    • Ranking and Rewriting Methods (e.g. quality, efficiency, etc.)
    • General Queries (e.g. correlated sources, aggregates, joins, etc.)
    • Learning Methods (e.g. accuracy, sample size, etc.)

Query Processing over Incomplete Autonomous Databases

experimental results ranking rewriting
Experimental Results – Ranking & Rewriting
  • QPIAD vs. ALLRETURNED - Quality

ALLRETURNED – all certain answers + all answers with nulls on constrained attributes

Query Processing over Incomplete Autonomous Databases

experimental results ranking rewriting1
Experimental Results – Ranking & Rewriting
  • QPIAD vs. ALLRANKED - Efficiency

ALLRANKED – all certain answers + all answers with predicted missing value probability above a threshold

Query Processing over Incomplete Autonomous Databases

experimental results general queries
Experimental Results – General Queries
  • Aggregates
  • Joins

Query Processing over Incomplete Autonomous Databases

experimental results learning methods
Experimental Results – Learning Methods
  • Accuracy of Classifiers

Query Processing over Incomplete Autonomous Databases

experimental summary
Experimental Summary
  • Rewriting / Ranking
    • Quality – QPIAD achieves higher precision than ALLRETURNED by only retrieving the relevant tuples
    • Efficiency – QPIAD requires fewer tuples to be retrieved to obtain the same level of recall as ALLRANKED
  • Learning Methods
    • AFDs for feature selection improved accuracy
  • General Queries
    • Aggregate queries achieve higher accuracy when missing value prediction is used
    • QPIAD achieves higher levels of recall for join queries while trading off only a small bit of precision
  • Additional Experiments
    • Robustness of learning methods w.r.t. sample size
    • Effect of alpha value on F-measure
    • Effectiveness of using correlation between sources

Query Processing over Incomplete Autonomous Databases

outline3
Outline
  • Core Techniques
  • Peripheral Techniques
  • Implementation & Evaluation
  • Conclusion & Future Work

Query Processing over Incomplete Autonomous Databases

related work
Related Work

All citations found in paper

  • Querying Incomplete Databases
    • Possible World Approaches – tracks the completions of incomplete tuples (Codd Tables, V-Tables, Conditional Tables)
    • Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers
  • Probabilistic Databases
    • Tuples are associated with an attribute describing the probability of its existence
    • However, in our work, the mediator does not have the capability to modify the underlying autonomous databases
  • Query Reformulation / Relaxation
    • Aims to return similar or approximate answers to the user after returning or in the absence of exact answers
    • Our focus is on retrieving tuples with missing values on constrained attributes
  • Learning Missing Values
    • Common imputation approaches replace missing values by substituting the mean, most common value, default value, or using kNN, association rules, etc.
    • Our work requires schema level dependencies between attributes as well as distribution information over missing values

Our work fits here

Query Processing over Incomplete Autonomous Databases

contributions
Contributions
  • Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns
    • Query Rewriting
  • Retrieves answers with missing values on constrained attributes without modifying the underlying databases
    • AFD-Enhanced Classifiers
  • Rewriting & ranking considers the natural tension between precision and recall
    • F-Measure based ranking
  • AFDs play a major role in:
    • Query Rewriting
    • Feature Selection
    • Explanations

Query Processing over Incomplete Autonomous Databases

current directions quic cidr 07 demo

Relevance Function

Density Function

Current Directions – QUIC (CIDR `07 Demo)

http://rakaposhi.eas.asu.edu/quic

Incomplete Data

Databases are often populated by:

  • Lay users entering data
  • Automated extraction

Imprecise Queries

User’s needs are not clearly defined:

  • Queries may be too general
  • Queries may be too specific

General Solution: “Expected Relevance Ranking”

Challenge: Automated & Non-intrusive assessment of Relevance and Density functions

Estimating Relevance (R):

Learn relevance for user population as a whole in terms of value similarity

  • Sum of weighted similarity for each constrained attribute
    • Content Based Similarity
    • Co-click Based Similarity
    • Co-occurrence Based Similarity

Query Processing over Incomplete Autonomous Databases