- 43 Views
- Uploaded on
- Presentation posted in: General

Probabilistic Ranking of Database Query Results

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

Presented by Weimin He

CSE@UTA

Motivation

Problem Definition

System Architecture

Construction of Ranking Function

Implementation

Experiments

Conclusion and open problems

Weimin He CSE@UTA

Realtor DB:

Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:

Select * From D

Where City=Seattle AND View=Waterfront

Weimin He CSE@UTA

- Many-answers problem
- Two alternative solutions:
Query reformulation

Automatic ranking

- Apply probabilistic model in IR to DB tuple ranking

Weimin He CSE@UTA

Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am}

and a query Q: SELECT * FROM D

WHERE

X1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

Weimin He CSE@UTA

Weimin He CSE@UTA

- Select * From D Where City=“Seattle” And View=“Waterfront”
Score of a Result Tuple t depends on

- Global Score: Global Importance of Unspecified Attribute Values
- E.g., Homes with good school districts are globally desirable

- Conditional Score: Correlations between Specified and Unspecified Attribute Values
- E.g., Waterfront BoatDock

Weimin He CSE@UTA

- Bayes’ Rule
- Product Rule

- Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Weimin He CSE@UTA

- Tuple t is considered as a document
- Partition t into t(X) and t(Y)
- t(X) and t(Y) are written as X and Y
- Derive from initial scoring function until final ranking function is obtained

Weimin He CSE@UTA

Weimin He CSE@UTA

- Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Weimin He CSE@UTA

Weimin He CSE@UTA

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

Weimin He CSE@UTA

Weimin He CSE@UTA

Relative frequency in W

Relative frequency in D

(#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Weimin He CSE@UTA

- Select * From D Where City=“Seattle” And View=“Waterfront”
- Y={SchoolDistrict, BoatDock, …}
- D=10,000 W=1000
- W{excellent}=10
- W{waterfront &yes}=5
- p(excellent|W)=10/1000=0.1
- p(excellent|D)=10/10,000=0.01
- p(waterfront|yes,W)=5/1000=0.005
- p(waterfront|yes,D)=5/10,000=0.0005

Weimin He CSE@UTA

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Weimin He CSE@UTA

Preprocessing - Atomic Probabilities Module

- Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
Execution

- Select Tuples that Satisfy the Query
- Scan and Compute Score for Each Result-Tuple
- Return Top-K Tuples

Weimin He CSE@UTA

- Scan algorithm is Inefficient
Many tuples in the answer set

- Another extreme
Pre-compute top-K tuples for all possible queries

Still infeasible in practice

- Trade-off solution
Pre-compute ranked lists of tuples for all possible atomic queries

At query time, merge ranked lists to get top-K tuples

Weimin He CSE@UTA

- CondList Cx
{AttName, AttVal, TID, CondScore}

B+ tree index on (AttName, AttVal, CondScore)

- GlobList Gx
{AttName, AttVal, TID, GlobScore}

B+ tree index on (AttName, AttVal, GlobScore)

Weimin He CSE@UTA

Weimin He CSE@UTA

Weimin He CSE@UTA

- Datasets:
- MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
- Internet Movie Database (http://www.imdb.com)

- Software and Hardware:
- Microsoft SQL Server2000 RDBMS
- P4 2.8-GHz PC, 1 GB RAM
- C#, Connected to RDBMS through DAO

Weimin He CSE@UTA

- Conducted on Seattle Homes and Movies tables
- Collect a workload from users
- Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

Weimin He CSE@UTA

- For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
- Let each user mark 10 tuples in Hi as most relevant to Qi
- Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Weimin He CSE@UTA

- 5 new queries
- Users were given the top-5 results

Weimin He CSE@UTA

- Datasets

- Compare 2 Algorithms:
- Scan algorithm
- List Merge algorithm

Weimin He CSE@UTA

Weimin He CSE@UTA

Weimin He CSE@UTA

Weimin He CSE@UTA

Weimin He CSE@UTA

- Automatic ranking for many-answers
- Adaptation of PIR to DB
- Mutiple-table query
- Non-categorical attributes

Weimin He CSE@UTA