# Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

1 / 32

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Probabilistic Ranking of Database Query Results

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

Presented by Weimin He

CSE@UTA

### Outline

Motivation

Problem Definition

System Architecture

Construction of Ranking Function

Implementation

Experiments

Conclusion and open problems

Weimin He CSE@UTA

### Motivating example

Realtor DB:

Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:

Select * From D

Where City=Seattle AND View=Waterfront

Weimin He CSE@UTA

### Motivation

• Two alternative solutions:

Query reformulation

Automatic ranking

• Apply probabilistic model in IR to DB tuple ranking

Weimin He CSE@UTA

### Problem Definition

Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am}

and a query Q: SELECT * FROM D

WHERE

X1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

Weimin He CSE@UTA

### System Architecture

Weimin He CSE@UTA

### Intuition for Ranking Function

• Select * From D Where City=“Seattle” And View=“Waterfront”

Score of a Result Tuple t depends on

• Global Score: Global Importance of Unspecified Attribute Values

• E.g., Homes with good school districts are globally desirable

• Conditional Score: Correlations between Specified and Unspecified Attribute Values

• E.g., Waterfront  BoatDock

Weimin He CSE@UTA

### Probabilistic Model in IR

• Bayes’ Rule

• Product Rule

• Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Weimin He CSE@UTA

### Adaptation of PIR to DB

• Tuple t is considered as a document

• Partition t into t(X) and t(Y)

• t(X) and t(Y) are written as X and Y

• Derive from initial scoring function until final ranking function is obtained

Weimin He CSE@UTA

### Preliminary Derivation

Weimin He CSE@UTA

### Limited Independence Assumptions

• Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Weimin He CSE@UTA

### Continuing Derivation

Weimin He CSE@UTA

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

Weimin He CSE@UTA

### Final Ranking Function

Weimin He CSE@UTA

### Pre-computing Atomic Probabilities in Ranking Function

Relative frequency in W

Relative frequency in D

(#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Weimin He CSE@UTA

### Example for Computing Atomic Probabilities

• Select * From D Where City=“Seattle” And View=“Waterfront”

• Y={SchoolDistrict, BoatDock, …}

• D=10,000 W=1000

• W{excellent}=10

• W{waterfront &yes}=5

• p(excellent|W)=10/1000=0.1

• p(excellent|D)=10/10,000=0.01

• p(waterfront|yes,W)=5/1000=0.005

• p(waterfront|yes,D)=5/10,000=0.0005

Weimin He CSE@UTA

### Indexing Atomic Probabilities

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Weimin He CSE@UTA

### Scan Algorithm

Preprocessing - Atomic Probabilities Module

• Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution

• Select Tuples that Satisfy the Query

• Scan and Compute Score for Each Result-Tuple

Weimin He CSE@UTA

### Beyond Scan Algorithm

• Scan algorithm is Inefficient

Many tuples in the answer set

• Another extreme

Pre-compute top-K tuples for all possible queries

Still infeasible in practice

Pre-compute ranked lists of tuples for all possible atomic queries

At query time, merge ranked lists to get top-K tuples

Weimin He CSE@UTA

### Two kinds of Ranked List

• CondList Cx

{AttName, AttVal, TID, CondScore}

B+ tree index on (AttName, AttVal, CondScore)

• GlobList Gx

{AttName, AttVal, TID, GlobScore}

B+ tree index on (AttName, AttVal, GlobScore)

Weimin He CSE@UTA

### Index Module

Weimin He CSE@UTA

### List Merge Algorithm

Weimin He CSE@UTA

### Experimental Setup

• Datasets:

• Internet Movie Database (http://www.imdb.com)

• Software and Hardware:

• Microsoft SQL Server2000 RDBMS

• P4 2.8-GHz PC, 1 GB RAM

• C#, Connected to RDBMS through DAO

Weimin He CSE@UTA

### Quality Experiments

• Conducted on Seattle Homes and Movies tables

• Collect a workload from users

• Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

Weimin He CSE@UTA

### Quality Experiment-Average Precision

• For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

• Let each user mark 10 tuples in Hi as most relevant to Qi

• Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Weimin He CSE@UTA

### Quality Experiment- Fraction of Users Preferring Each Algorithm

• 5 new queries

• Users were given the top-5 results

Weimin He CSE@UTA

### Performance Experiments

• Datasets

• Compare 2 Algorithms:

• Scan algorithm

• List Merge algorithm

Weimin He CSE@UTA

### Performance Experiments – Pre-computation Time

Weimin He CSE@UTA

### Performance Experiments – Execution Time

Weimin He CSE@UTA

### Performance Experiments – Execution Time

Weimin He CSE@UTA

### Performance Experiments – Execution Time

Weimin He CSE@UTA