Probabilistic ranking of database query results
Sponsored Links
This presentation is the property of its rightful owner.
1 / 32

Probabilistic Ranking of Database Query Results PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition

Download Presentation

Probabilistic Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

Presented by Weimin He

CSE@UTA


Outline

Motivation

Problem Definition

System Architecture

Construction of Ranking Function

Implementation

Experiments

Conclusion and open problems

Weimin He CSE@UTA


Motivating example

Realtor DB:

Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:

Select * From D

Where City=Seattle AND View=Waterfront

Weimin He CSE@UTA


Motivation

  • Many-answers problem

  • Two alternative solutions:

    Query reformulation

    Automatic ranking

  • Apply probabilistic model in IR to DB tuple ranking

Weimin He CSE@UTA


Problem Definition

Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am}

and a query Q: SELECT * FROM D

WHERE

X1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

Weimin He CSE@UTA


System Architecture

Weimin He CSE@UTA


Intuition for Ranking Function

  • Select * From D Where City=“Seattle” And View=“Waterfront”

    Score of a Result Tuple t depends on

  • Global Score: Global Importance of Unspecified Attribute Values

    • E.g., Homes with good school districts are globally desirable

  • Conditional Score: Correlations between Specified and Unspecified Attribute Values

    • E.g., Waterfront  BoatDock

Weimin He CSE@UTA


Probabilistic Model in IR

  • Bayes’ Rule

  • Product Rule

  • Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Weimin He CSE@UTA


Adaptation of PIR to DB

  • Tuple t is considered as a document

  • Partition t into t(X) and t(Y)

  • t(X) and t(Y) are written as X and Y

  • Derive from initial scoring function until final ranking function is obtained

Weimin He CSE@UTA


Preliminary Derivation

Weimin He CSE@UTA


Limited Independence Assumptions

  • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Weimin He CSE@UTA


Continuing Derivation

Weimin He CSE@UTA


Workload-based Estimation of

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

Weimin He CSE@UTA


Final Ranking Function

Weimin He CSE@UTA


Pre-computing Atomic Probabilities in Ranking Function

Relative frequency in W

Relative frequency in D

(#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Weimin He CSE@UTA


Example for Computing Atomic Probabilities

  • Select * From D Where City=“Seattle” And View=“Waterfront”

  • Y={SchoolDistrict, BoatDock, …}

  • D=10,000 W=1000

  • W{excellent}=10

  • W{waterfront &yes}=5

  • p(excellent|W)=10/1000=0.1

  • p(excellent|D)=10/10,000=0.01

  • p(waterfront|yes,W)=5/1000=0.005

  • p(waterfront|yes,D)=5/10,000=0.0005

Weimin He CSE@UTA


Indexing Atomic Probabilities

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Weimin He CSE@UTA


Scan Algorithm

Preprocessing - Atomic Probabilities Module

  • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

    Execution

  • Select Tuples that Satisfy the Query

  • Scan and Compute Score for Each Result-Tuple

  • Return Top-K Tuples

Weimin He CSE@UTA


Beyond Scan Algorithm

  • Scan algorithm is Inefficient

    Many tuples in the answer set

  • Another extreme

    Pre-compute top-K tuples for all possible queries

    Still infeasible in practice

  • Trade-off solution

    Pre-compute ranked lists of tuples for all possible atomic queries

    At query time, merge ranked lists to get top-K tuples

Weimin He CSE@UTA


Two kinds of Ranked List

  • CondList Cx

    {AttName, AttVal, TID, CondScore}

    B+ tree index on (AttName, AttVal, CondScore)

  • GlobList Gx

    {AttName, AttVal, TID, GlobScore}

    B+ tree index on (AttName, AttVal, GlobScore)

Weimin He CSE@UTA


Index Module

Weimin He CSE@UTA


List Merge Algorithm

Weimin He CSE@UTA


Experimental Setup

  • Datasets:

    • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

    • Internet Movie Database (http://www.imdb.com)

  • Software and Hardware:

    • Microsoft SQL Server2000 RDBMS

    • P4 2.8-GHz PC, 1 GB RAM

    • C#, Connected to RDBMS through DAO

Weimin He CSE@UTA


Quality Experiments

  • Conducted on Seattle Homes and Movies tables

  • Collect a workload from users

  • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

Weimin He CSE@UTA


Quality Experiment-Average Precision

  • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

  • Let each user mark 10 tuples in Hi as most relevant to Qi

  • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Weimin He CSE@UTA


Quality Experiment- Fraction of Users Preferring Each Algorithm

  • 5 new queries

  • Users were given the top-5 results

Weimin He CSE@UTA


Performance Experiments

  • Datasets

  • Compare 2 Algorithms:

    • Scan algorithm

    • List Merge algorithm

Weimin He CSE@UTA


Performance Experiments – Pre-computation Time

Weimin He CSE@UTA


Performance Experiments – Execution Time

Weimin He CSE@UTA


Performance Experiments – Execution Time

Weimin He CSE@UTA


Performance Experiments – Execution Time

Weimin He CSE@UTA


Conclusion and Open Problems

  • Automatic ranking for many-answers

  • Adaptation of PIR to DB

  • Mutiple-table query

  • Non-categorical attributes

Weimin He CSE@UTA


  • Login