Probabilistic Ranking of Database Query Results

1 / 32

# Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He [email protected] Outline. Motivation Problem Definition

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Probabilistic Ranking of Database Query Results' - kim-sherman

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

Presented by Weimin He

[email protected]

Outline

Motivation

Problem Definition

System Architecture

Construction of Ranking Function

Implementation

Experiments

Conclusion and open problems

Weimin He [email protected]

Motivating example

Realtor DB:

Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:

Select * From D

Where City=Seattle AND View=Waterfront

Weimin He [email protected]

Motivation
• Two alternative solutions:

Query reformulation

Automatic ranking

• Apply probabilistic model in IR to DB tuple ranking

Weimin He [email protected]

Problem Definition

Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am}

and a query Q: SELECT * FROM D

WHERE

X1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

Weimin He [email protected]

Intuition for Ranking Function
• Select * From D Where City=“Seattle” And View=“Waterfront”

Score of a Result Tuple t depends on

• Global Score: Global Importance of Unspecified Attribute Values
• E.g., Homes with good school districts are globally desirable
• Conditional Score: Correlations between Specified and Unspecified Attribute Values
• E.g., Waterfront  BoatDock

Weimin He [email protected]

Probabilistic Model in IR
• Bayes’ Rule
• Product Rule
• Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Weimin He [email protected]

• Tuple t is considered as a document
• Partition t into t(X) and t(Y)
• t(X) and t(Y) are written as X and Y
• Derive from initial scoring function until final ranking function is obtained

Weimin He [email protected]

Limited Independence Assumptions
• Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Weimin He [email protected]

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

Weimin He [email protected]

Pre-computing Atomic Probabilities in Ranking Function

Relative frequency in W

Relative frequency in D

(#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Weimin He [email protected]

Example for Computing Atomic Probabilities
• Select * From D Where City=“Seattle” And View=“Waterfront”
• Y={SchoolDistrict, BoatDock, …}
• D=10,000 W=1000
• W{excellent}=10
• W{waterfront &yes}=5
• p(excellent|W)=10/1000=0.1
• p(excellent|D)=10/10,000=0.01
• p(waterfront|yes,W)=5/1000=0.005
• p(waterfront|yes,D)=5/10,000=0.0005

Weimin He [email protected]

Indexing Atomic Probabilities

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Weimin He [email protected]

Scan Algorithm

Preprocessing - Atomic Probabilities Module

• Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution

• Select Tuples that Satisfy the Query
• Scan and Compute Score for Each Result-Tuple

Weimin He [email protected]

Beyond Scan Algorithm
• Scan algorithm is Inefficient

Many tuples in the answer set

• Another extreme

Pre-compute top-K tuples for all possible queries

Still infeasible in practice

Pre-compute ranked lists of tuples for all possible atomic queries

At query time, merge ranked lists to get top-K tuples

Weimin He [email protected]

Two kinds of Ranked List
• CondList Cx

{AttName, AttVal, TID, CondScore}

B+ tree index on (AttName, AttVal, CondScore)

• GlobList Gx

{AttName, AttVal, TID, GlobScore}

B+ tree index on (AttName, AttVal, GlobScore)

Weimin He [email protected]

Experimental Setup
• Datasets:
• Internet Movie Database (http://www.imdb.com)
• Software and Hardware:
• Microsoft SQL Server2000 RDBMS
• P4 2.8-GHz PC, 1 GB RAM
• C#, Connected to RDBMS through DAO

Weimin He [email protected]

Quality Experiments
• Conducted on Seattle Homes and Movies tables
• Collect a workload from users
• Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

Weimin He [email protected]

Quality Experiment-Average Precision
• For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
• Let each user mark 10 tuples in Hi as most relevant to Qi
• Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Weimin He [email protected]

• 5 new queries
• Users were given the top-5 results

Weimin He [email protected]

Performance Experiments
• Datasets
• Compare 2 Algorithms:
• Scan algorithm
• List Merge algorithm

Weimin He [email protected]

Conclusion and Open Problems