An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query ...
Download
1 / 50

MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao Kambhampti (Chair) Dr. Joohyu - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases. MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao Kambhampti (Chair)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MS Thesis Defense Rohit Raghunathan August 19 th , 2011 Committee Members Dr. Subbarao Kambhampti (Chair) Dr. Joohyu' - tamarr


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

MS Thesis Defense

RohitRaghunathan

August 19th, 2011

Committee Members

Dr. Subbarao Kambhampti (Chair)

Dr. Joohyung Lee

Dr. Huan Liu


Overview of the talk
Overview of the talk Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting


Overview of the talk1
Overview of the talk Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting


Introduction to web databases
Introduction to Web databases Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Many websites allow user query through a form based interface and are supported by backend databases

  • Consider used cars selling websites such as Cars.com, Yahoo! autos, etc


Incompleteness in web databases
Incompleteness in Web databases Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! Autos

  • Web databases are being populated using automated information extraction techniques which are inherently imperfect

  • Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value


Problem statement
Problem Statement Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Many entities corresponding to tuples with missing values might be relevant to the user query

  • Traditional query processing does not retrieve such tuples

Q: Make = Honda


Dimensions of the problem
Dimensions of the problem Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Single vs Multiple missing values

    • Multiple missing values requires capturing the correlations between them

  • Imputation vs Query Rewriting

    • Imputation can look at all available evidence

    • Query Rewriting requires finding the smallest number of evidences

      • Looking at all evidences -> reduces throughput

      • Looking at very few evidences -> reduction in precision

      • Need to find middle ground

User Q: Model = A8

Rewritten Query

Make = Audi ^ Body = Sedan


Overview of the talk2
Overview of the talk Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting


Approximate functional dependencies afds
Approximate Functional Dependencies (AFDs) Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • AFDs are Functional Dependencies that hold on all but a small fraction of the database

Model  Body : 0.75

Make Body : 0.75

Model  Make : 1.0

  • An AFD is of the form XA

  • where X is a set of attributes and A is a single attribute

  • An attribute can have multiple rules


Overview of qpiad
Overview of QPIAD Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

Q: Body = Sedan

  • QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answers

  • When mediator has access privileges to modify the underlying data source

    • Missing values can be completed by a simple classification task. (Imputation)

    • After which Traditional query processing will suffice

  • When mediators do not have such privileges

    • Generate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)

      Issuing

      Q1 : Model = Tl

      Q2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.

  • QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting

  • Techniques for combining multiple AFDs shown to be ineffective

Relevant incomplete answers

Model  Body : 0.75


Shortcomings of afd based approaches
Shortcomings of AFD-based approaches Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Principles of locality and detachment do not hold for uncertain reasoning

  • Model  Body (0.7)

  • This intuitively means that model of a car determines the body of a car with a probability of 0.7 when no other evidence is available.

  • When other evidences are present, there is no easy way to combine the probabilities


Shortcomings of afd based approaches1
Shortcomings of AFD-based approaches Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Imputing the missing values in T2 using a single AFD; ignore influence from other attributes

  • Imputing missing values in T1 ignores the correlations between the attributes Model and Year

  • Imputing missing values in T6 will get AFDs into cycles

  • Model  Make Make Model


Overview of the talk3
Overview of the talk Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Overview of the talk4
Overview of the talk Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Bayes network
Bayes network Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

Year

  • A Bayes network is a DAG representing the probabilistic dependencies between attributes

  • It is a compact representation of the full joint distribution

    • Therefore influence from all variables are accounted

  • It represents the generative model of the autonomous database

Model

Mileage

Make

Body

CPDs model the strength of the probabilistic dependencies


Challenges in using bayes networks for handling incompleteness in autonomous databases
Challenges in using Bayes networks for handling incompleteness in Autonomous databases

  • Learning and inference with Bayes networks is computationally harder than AFDs

    • Learning the topology and parameters from data involves searching over search the space of topologies

      • But can be done offline

    • Inference in a general Bayes network is intractable.

      • But can use approximate inference

Question: Can we get benefits of exact inference while containing costs?


Overview of the talk5
Overview of the talk incompleteness in Autonomous databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Learning a bayes network model
Learning a Bayes network model incompleteness in Autonomous databases

  • Structure & Parameter Learning From Data

    • Challenge: Involves searching over topologies

    • Use Banjo Software Package as black-box.

    • Experiments show learned topology is robust w.r.t

      • Sample size(5-20%) – same topology

      • Search time(5-30 minutes) – same topology

      • Max parent count (2-4) – same topology; significantly higher networks examined in case of 2.


Inference in bayes networks
Inference in Bayes networks incompleteness in Autonomous databases

  • Exact Techniques

    • NP-hard, in the general case. Therefore, do not scale well with increase in incompleteness

    • Junction Tree (fastest; but inapplicable when query variables do not form a clique)

    • Variable Elimination

  • Approximate Techniques (Scales well; retaining accuracy of exact methods)

    • Gibbs Sampling

    • Using Infer.net package allows us to use Expectation Propagation inference


Overview of the talk6
Overview of the talk incompleteness in Autonomous databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Imputation
Imputation incompleteness in Autonomous databases

  • Experimental Setup

    • Test Databases: Cars.com database containing 8K tuples and Adult Database from UCI repository containing 15K tuples

    • Bayes net inference

      • Exact inference: Junction Tree, Variable Elimination

      • Approximate inference: Gibbs Sampling


Imputation1
Imputation incompleteness in Autonomous databases

  • Remove all the values for the attribute being predicted

  • Substitute missing value with most likely value

  • AFD-approach

    • Use only highest confidence AFD (Use all attributes if confidence is low, e.g., mileage(Cars)). Called Hybrid-one by authors of QPIAD.

  • Bayes net

    • Infer the posterior distribution of missing attribute, given evidences of the other attributes in the tuple


Overview of the talk7
Overview of the talk incompleteness in Autonomous databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Imputation single missing attribute
Imputation- single missing attribute incompleteness in Autonomous databases

  • Significant difference for attributes Model and Year.

  • AFDs using only the highest confidence rule, and ignore others.

    • Attempts at combining evidences from multiple rules have been ineffective.

  • Bayes nets systematically combines all evidences.


Imputation multiple missing attributes
Imputation- multiple missing attributes incompleteness in Autonomous databases

  • AFD-approach

    • Predict each missing value independently

    • Can get in cycles

  • Bayes net

    • Computes the Joint distribution over the missing attributes.

Make  Model

Model  Make


Imputation multiple missing attributes1
Imputation- multiple missing attributes incompleteness in Autonomous databases

  • When missing attributes are correlated, they often get into cycles

    • Only 9 out of 20 combinations could be predicted when 3 attributes are missing

  • AFD accuracies are lower as they use a single rule independently for prediction

    • BNs systematically combine evidences from multiple sources and capture correlations by finding the joint distribution

  • When attributes are D-separated and involve attributes which have similar prediction accuracies for both methods, there is no difference in accuracy

Year

Mileage

Model

Price

Make

Body


Overview of the talk8
Overview of the talk incompleteness in Autonomous databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Imputation increase in incompleteness in test data
Imputation- Increase in incompleteness in test data incompleteness in Autonomous databases

  • Evidence for predicting missing values reduces with increase in incompleteness

  • AFD-approach

    • Chain missing values in determining set of AFD

  • Bayes net

    • No change. Just compute posterior distribution of the attributes to be imputed given the evidence.

Q: Model = 745

AFDs: Make, Body  Model

Year  Body


Imputation increase in incompleteness in test data1
Imputation- Increase in incompleteness in test data incompleteness in Autonomous databases


Time taken for imputation
Time Taken For Imputation incompleteness in Autonomous databases

BN-Gibbs retains the accuracy edge of BN-Exact while containing costs


Overview of the talk9
Overview of the talk incompleteness in Autonomous databases

  • Introduction to Incomplete Autonomous Databases

  • Overview of QPIAD and shortcomings of AFD-based approaches

  • Our approach: Bayes network based imputation and query rewriting

    • Introduction

    • Learning Bayes network models from data

    • Imputation

      • Single and multiple missing values

      • Varying levels of incompleteness in test data

    • Query Rewriting

      • Bayes network based rewriting

      • Comparison of Bayes network based rewriting and AFDs


Query rewriting
Query Rewriting incompleteness in Autonomous databases

  • When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.

  • Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.


Query rewriting single attribute queries
Query Rewriting– Single-attribute queries incompleteness in Autonomous databases

Q: Body = Sedan

Relevant incomplete answers

CERTAIN ANSWERS (BASE RESULT SET)

Can retrieve

T2 with Q’1: Model = Tl

T4 with Q’2: Model = 745


Generating rewritten queries
Generating Rewritten Queries incompleteness in Autonomous databases

CERTAIN ANSWERS (BASE RESULT SET)

Year

Bayes Networks

ATTRIBUTES: ALL ATTRIBUTES IN MARKOV BLANKET

(BN-ALL-MB)

Q’1: Model = 745

Q’2: Model = Tl

AFDs

ATTRIBUTES: DETERMINING SET OF AFD

Model  Body : 0.9

Q’1: Model = 745

Q’2: Model = Tl

Model

Mileage

Body

Make

Given evidence of all attributes in MARKOV BLANKET, an attribute is independent of ALL other attributes

Q: Body = Sedan


Ranking rewritten queries
Ranking Rewritten queries incompleteness in Autonomous databases

  • All queries may not be equally good in retrieving relevant answers

    • “tl” model cars are more likely to be sedans than a car with “745” model

  • Rank queries based on their expected precision (ExpPrec)

ExpPrec(Q) = P(Am=vm|ti)

where tiεПMB(Am)(RS(Q)) for Bayes nets

where tiεПdtrSet(Am)(RS(Q)) for AFDs

Q1’: Model = ‘tl’.

ExpPrec(Q1’)= P(Body=Sedan|Model=tl) = 1

Q2’= Model = ‘745’.

ExpPrec(Q2’)= P(Body=Sedan|Model=745) = 0.6

AFDs

Use Naïve Bayes Classifiers

Bayes Networks

Inference in bayes network


Ranking rewritten queries only k queries
Ranking Rewritten Queries- only incompleteness in Autonomous databasesK queries

  • When database or network resources are limited, the mediator can choose to issue the top-Kqueries to get the most relevant uncertain answers

    • It is important to carefully trade precision with throughput

  • Use F-measure metric (idea borrowed from QPIAD)

P – expected precision (e.g. P(Model=745|Make =BMW) )

R – expected recall

R = expected precision * expected selectivity

expected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

=0 – only precision


Experimental setup
Experimental Setup incompleteness in Autonomous databases

  • Test databases: Cars database consisting of 55K tuples and Adult database consisting of 15K tuples

  • Training set 15% of the database.

  • Test data split in two halves-

    • One half contains no incompleteness and is used to return the base result set

    • In the other half all query-constrained attributes are made null

    • A copy of test data is used as the ground truth to compute precision and recall

    • This is an aggressive setup since most databases have <50% incompleteness


Bn all mb vs afd
BN-All-MB incompleteness in Autonomous databasesvs AFD

BN-All-MB: P(Make=bmw|model= 330)

AFD: P(Make=bmw|model=330)

Q: Make

  • When size of determining set > 1 Expected Precision values represented of AFDs (represented by NBCs) are inaccurate

  • Actual precision is lower for AFDs because their expected precisions are inaccurate


Shortcoming of bn all mb
Shortcoming of BN-All-MB incompleteness in Autonomous databases

Year

  • Throughput of queries reduces drastically as markov blanket size increases

    • Use F-measure based ranking to increase recall

      When almost all queries have very low throughput there is simply no way to increase recall

Model

Mileage

Body

Make

Q: Model = 745

Q’1: MakeᴧBodyᴧYear

Q’2: MakeᴧBodyᴧYear

Q’3: MakeᴧBodyᴧYear


Bn beam single attribute queries
BN-Beam (Single-attribute queries) incompleteness in Autonomous databases

Q: Model = 745

Year

Mileage

Model

Candidate Attribute Set = {Year, Make, Body}

Body

Make


Bn beam

Best rewritten queries of size 1 incompleteness in Autonomous databases

BN-Beam

Year

Body

At Level L all (partial) queries have ≤ L attributes constrained

Pick Top-K queries at each level based on F-measure metric

Issue to database in the increasing order of expected precision

P – expected precision (e.g. P(Model=745|Make =BMW) )

R – expected recall

R = expected precision * expected selectivity

Expected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database


Bn beam vs bn all mb
BN-Beam incompleteness in Autonomous databasesvs BN-All-MB

Recall Plot

Results for Top-10 queries for user query Year = 2002

Precision Plot

  • Increasing α does not increase recall of BN-All-MB

  • BN-Beam increases recall without a catastrophic reduction in precision


Multi attribute queries
Multi-attribute queries incompleteness in Autonomous databases

  • Contribution to QPIAD

  • Aim: To retrieve relevant uncertain answers with multiple-missing values on query-constrained attributes.


Multi attribute queries1
Multi-attribute queries incompleteness in Autonomous databases

QPIAD

BN-Beam

Base result set

  • Q: Make = BMW ʌ Mileage = 40000

  • Base result set = T5, T6

  • QPIAD retrieves T1 and T2.

  • BN-Beam can also retrieve T3 and T4.

  • Candidate attribute set: union of attributes in the markov blanket of all constrained attributes

  • All other steps same as single-attribute query case


Comparison over multi attribute queries
Comparison over multi-attribute queries incompleteness in Autonomous databases

  • Two AFD approaches

    • AFD-All-Attributes: Creates a conjunctive query by joining all attributes in the determining set of the AFDs of the constrained attributes.

      Consider AFDs

      Model  Make Year  Mileage

      Q: Make = BMW ʌ Mileage = 40000

Expected Precision = Product of individual query’s expected precision

Q’1: Model=745ᴧYear=2001

Q’2: Model=645ᴧYear=2001

Q’3: Model=745ᴧYear=2002

Q’4: Model=645ᴧYear=2002


Bn beam vs afd all attributes
BN-Beam incompleteness in Autonomous databasesvsAFD-All-Attributes

Results for top-10 queries

Q: Make ^ Mileage

Precision of BN-Beam is competitive with AFD-All Attributes

Recall of BN-Beam is higher

  • AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.

  • Leads to low throughput or even empty queries


Comparison of multi attribute queries
Comparison of multi-attribute queries incompleteness in Autonomous databases

  • AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting

    Q: Make = Dodge ᴧ Year = 2004

    IGNORE all attributes other than Make

    AFD : Model  Make

    Q’1: Model=ram

    Q’2: Model= intrepid


Bn beam vs afd highest confidence
BN-Beam incompleteness in Autonomous databasesvsAFD-Highest-Confidence

Results for top-10 queries

Q:Make ʌ Year

(Car database)

AFD-Highest-Confidence increases recall but NOT WITHOUT a CATASTROPHIC drop in precision


Summary
Summary incompleteness in Autonomous databases

  • A comparison of cost and accuracy tradeoffs of using Bayes network models and AFDs for handling incompleteness in autonomous databases

  • Bayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data.

  • Presented two approaches- BN-All-MB and BN-Beam for generating rewritten queries using Bayes networks. We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall


Deviations from the thesis draft
Deviations From the Thesis Draft incompleteness in Autonomous databases

  • CAVEAT: I found two bugs in my code (Query Rewriting section)

  • Corrected one bug (related to BN-based rewriting)

  • Will correct the other one (related to AFD-based rewriting) after the defense

THANK YOU

QUESTIONS?


ad