1 / 32

Probabilistic Ranking of Database Query Results

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition

kim-sherman
Download Presentation

Probabilistic Ranking of Database Query Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Weimin He CSE@UTA

  2. Outline Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems Weimin He CSE@UTA

  3. Motivating example Realtor DB: Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock) SQL query: Select * From D Where City=Seattle AND View=Waterfront Weimin He CSE@UTA

  4. Motivation • Many-answers problem • Two alternative solutions: Query reformulation Automatic ranking • Apply probabilistic model in IR to DB tuple ranking Weimin He CSE@UTA

  5. Problem Definition Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes Let be the answer set of Q How to rank tuples in S and return top-k tuples to the user ? Weimin He CSE@UTA

  6. System Architecture Weimin He CSE@UTA

  7. Intuition for Ranking Function • Select * From D Where City=“Seattle” And View=“Waterfront” Score of a Result Tuple t depends on • Global Score: Global Importance of Unspecified Attribute Values • E.g., Homes with good school districts are globally desirable • Conditional Score: Correlations between Specified and Unspecified Attribute Values • E.g., Waterfront  BoatDock Weimin He CSE@UTA

  8. Probabilistic Model in IR • Bayes’ Rule • Product Rule • Document t, Query QR: Relevant document setR = D - R: Irrelevant document set Weimin He CSE@UTA

  9. Adaptation of PIR to DB • Tuple t is considered as a document • Partition t into t(X) and t(Y) • t(X) and t(Y) are written as X and Y • Derive from initial scoring function until final ranking function is obtained Weimin He CSE@UTA

  10. Preliminary Derivation Weimin He CSE@UTA

  11. Limited Independence Assumptions • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed Weimin He CSE@UTA

  12. Continuing Derivation Weimin He CSE@UTA

  13. Workload-based Estimation of Assume a collection of “past” queries existed in system Workload W is represented as a set of “tuples” Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X Weimin He CSE@UTA

  14. Final Ranking Function Weimin He CSE@UTA

  15. Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Weimin He CSE@UTA

  16. Example for Computing Atomic Probabilities • Select * From D Where City=“Seattle” And View=“Waterfront” • Y={SchoolDistrict, BoatDock, …} • D=10,000 W=1000 • W{excellent}=10 • W{waterfront &yes}=5 • p(excellent|W)=10/1000=0.1 • p(excellent|D)=10/10,000=0.01 • p(waterfront|yes,W)=5/1000=0.005 • p(waterfront|yes,D)=5/10,000=0.0005 Weimin He CSE@UTA

  17. Indexing Atomic Probabilities {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) Weimin He CSE@UTA

  18. Scan Algorithm Preprocessing - Atomic Probabilities Module • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution • Select Tuples that Satisfy the Query • Scan and Compute Score for Each Result-Tuple • Return Top-K Tuples Weimin He CSE@UTA

  19. Beyond Scan Algorithm • Scan algorithm is Inefficient Many tuples in the answer set • Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice • Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples Weimin He CSE@UTA

  20. Two kinds of Ranked List • CondList Cx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) • GlobList Gx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore) Weimin He CSE@UTA

  21. Index Module Weimin He CSE@UTA

  22. List Merge Algorithm Weimin He CSE@UTA

  23. Experimental Setup • Datasets: • MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) • Internet Movie Database (http://www.imdb.com) • Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO Weimin He CSE@UTA

  24. Quality Experiments • Conducted on Seattle Homes and Movies tables • Collect a workload from users • Compare Conditional Ranking Method in the paper with the Global Method [CIDR03] Weimin He CSE@UTA

  25. Quality Experiment-Average Precision • For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples • Let each user mark 10 tuples in Hi as most relevant to Qi • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm Weimin He CSE@UTA

  26. Quality Experiment- Fraction of Users Preferring Each Algorithm • 5 new queries • Users were given the top-5 results Weimin He CSE@UTA

  27. Performance Experiments • Datasets • Compare 2 Algorithms: • Scan algorithm • List Merge algorithm Weimin He CSE@UTA

  28. Performance Experiments – Pre-computation Time Weimin He CSE@UTA

  29. Performance Experiments – Execution Time Weimin He CSE@UTA

  30. Performance Experiments – Execution Time Weimin He CSE@UTA

  31. Performance Experiments – Execution Time Weimin He CSE@UTA

  32. Conclusion and Open Problems • Automatic ranking for many-answers • Adaptation of PIR to DB • Mutiple-table query • Non-categorical attributes Weimin He CSE@UTA

More Related