Probabilistic ranking of database query results
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Probabilistic Ranking of Database Query Results PowerPoint PPT Presentation


  • 39 Views
  • Uploaded on
  • Presentation posted in: General

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He [email protected] Outline. Motivation Problem Definition

Download Presentation

Probabilistic Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Probabilistic ranking of database query results

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

Presented by Weimin He

[email protected]


Outline

Outline

Motivation

Problem Definition

System Architecture

Construction of Ranking Function

Implementation

Experiments

Conclusion and open problems

Weimin He [email protected]


Motivating example

Motivating example

Realtor DB:

Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:

Select * From D

Where City=Seattle AND View=Waterfront

Weimin He [email protected]


Motivation

Motivation

  • Many-answers problem

  • Two alternative solutions:

    Query reformulation

    Automatic ranking

  • Apply probabilistic model in IR to DB tuple ranking

Weimin He [email protected]


Problem definition

Problem Definition

Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am}

and a query Q: SELECT * FROM D

WHERE

X1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

Weimin He [email protected]


System architecture

System Architecture

Weimin He [email protected]


Intuition for ranking function

Intuition for Ranking Function

  • Select * From D Where City=“Seattle” And View=“Waterfront”

    Score of a Result Tuple t depends on

  • Global Score: Global Importance of Unspecified Attribute Values

    • E.g., Homes with good school districts are globally desirable

  • Conditional Score: Correlations between Specified and Unspecified Attribute Values

    • E.g., Waterfront  BoatDock

Weimin He [email protected]


Probabilistic model in ir

Probabilistic Model in IR

  • Bayes’ Rule

  • Product Rule

  • Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Weimin He [email protected]


Adaptation of pir to db

Adaptation of PIR to DB

  • Tuple t is considered as a document

  • Partition t into t(X) and t(Y)

  • t(X) and t(Y) are written as X and Y

  • Derive from initial scoring function until final ranking function is obtained

Weimin He [email protected]


Preliminary derivation

Preliminary Derivation

Weimin He [email protected]


Limited independence assumptions

Limited Independence Assumptions

  • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Weimin He [email protected]


Continuing derivation

Continuing Derivation

Weimin He [email protected]


Workload based estimation of

Workload-based Estimation of

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

Weimin He [email protected]


Final ranking function

Final Ranking Function

Weimin He [email protected]


Pre computing atomic probabilities in ranking function

Pre-computing Atomic Probabilities in Ranking Function

Relative frequency in W

Relative frequency in D

(#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Weimin He [email protected]


Example for computing atomic probabilities

Example for Computing Atomic Probabilities

  • Select * From D Where City=“Seattle” And View=“Waterfront”

  • Y={SchoolDistrict, BoatDock, …}

  • D=10,000 W=1000

  • W{excellent}=10

  • W{waterfront &yes}=5

  • p(excellent|W)=10/1000=0.1

  • p(excellent|D)=10/10,000=0.01

  • p(waterfront|yes,W)=5/1000=0.005

  • p(waterfront|yes,D)=5/10,000=0.0005

Weimin He [email protected]


Indexing atomic probabilities

Indexing Atomic Probabilities

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Weimin He [email protected]


Scan algorithm

Scan Algorithm

Preprocessing - Atomic Probabilities Module

  • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

    Execution

  • Select Tuples that Satisfy the Query

  • Scan and Compute Score for Each Result-Tuple

  • Return Top-K Tuples

Weimin He [email protected]


Beyond scan algorithm

Beyond Scan Algorithm

  • Scan algorithm is Inefficient

    Many tuples in the answer set

  • Another extreme

    Pre-compute top-K tuples for all possible queries

    Still infeasible in practice

  • Trade-off solution

    Pre-compute ranked lists of tuples for all possible atomic queries

    At query time, merge ranked lists to get top-K tuples

Weimin He [email protected]


Two kinds of ranked list

Two kinds of Ranked List

  • CondList Cx

    {AttName, AttVal, TID, CondScore}

    B+ tree index on (AttName, AttVal, CondScore)

  • GlobList Gx

    {AttName, AttVal, TID, GlobScore}

    B+ tree index on (AttName, AttVal, GlobScore)

Weimin He [email protected]


Index module

Index Module

Weimin He [email protected]


List merge algorithm

List Merge Algorithm

Weimin He [email protected]


Experimental setup

Experimental Setup

  • Datasets:

    • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

    • Internet Movie Database (http://www.imdb.com)

  • Software and Hardware:

    • Microsoft SQL Server2000 RDBMS

    • P4 2.8-GHz PC, 1 GB RAM

    • C#, Connected to RDBMS through DAO

Weimin He [email protected]


Quality experiments

Quality Experiments

  • Conducted on Seattle Homes and Movies tables

  • Collect a workload from users

  • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

Weimin He [email protected]


Quality experiment average precision

Quality Experiment-Average Precision

  • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

  • Let each user mark 10 tuples in Hi as most relevant to Qi

  • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Weimin He [email protected]


Quality experiment fraction of users preferring each algorithm

Quality Experiment- Fraction of Users Preferring Each Algorithm

  • 5 new queries

  • Users were given the top-5 results

Weimin He [email protected]


Performance experiments

Performance Experiments

  • Datasets

  • Compare 2 Algorithms:

    • Scan algorithm

    • List Merge algorithm

Weimin He [email protected]


Performance experiments pre computation time

Performance Experiments – Pre-computation Time

Weimin He [email protected]


Performance experiments execution time

Performance Experiments – Execution Time

Weimin He [email protected]


Performance experiments execution time1

Performance Experiments – Execution Time

Weimin He [email protected]


Performance experiments execution time2

Performance Experiments – Execution Time

Weimin He [email protected]


Conclusion and open problems

Conclusion and Open Problems

  • Automatic ranking for many-answers

  • Adaptation of PIR to DB

  • Mutiple-table query

  • Non-categorical attributes

Weimin He [email protected]


  • Login