1 / 22

Best Position Algorithms for Top-k Queries

Best Position Algorithms for Top-k Queries. Reza Akbarinia , Esther Pacitti , Patrick Valduriez Atlas Team, INRIA and LINA, Nantes University of Nantes, France VLDB’07, September 23-28, 2007, Vienna, Austria 2007. 12. 28

ivria
Download Presentation

Best Position Algorithms for Top-k Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Best Position Algorithms for Top-k Queries Reza Akbarinia, Esther Pacitti, Patrick Valduriez Atlas Team, INRIA and LINA, Nantes University of Nantes, France VLDB’07, September 23-28, 2007, Vienna, Austria 2007. 12. 28 Summarized and Presented by Seungseok Kang, IDS Lab., Seoul National University

  2. Outline • Introduction • Problem Definition • Related Work • Fagin’s Alg. (FA) • Threshold Alg. (TA) • Best Position Algorithm (BPA) • Optimization (BPA2) • Performance Evaluation • Conclusion Center for E-Business Technology

  3. Top-k Query • Returns only the k most relevant answers • Scoring function (sf): determines the answers’ relevance (score) • Advantage • avoid overwhelming the user with large numbers of uninteresting answers • Useful in many areas • Network and system monitoring • Information retrieval • Multimedia databases • Sensor networks • Data stream systems • P2P systems • Etc. • Hard to support efficiently • Need to aggregate overall scores from local scores Center for E-Business Technology

  4. General Model for Top-k Queries • Suppose • n data items • m lists of the n data items such that • Each data item has a local score in each list • Each List is sorted in decreasing order of the local scores • Overall score of a data item • Computed based on its local scores using a given scoring function • Objective • Find the k data items whose overall scores are the highest w.r.t. a given scoring function Center for E-Business Technology

  5. General Model (contd.) • Top-k tuples in relational tables: • Have a sorted list (index) over each attribute • Then, find the k tuples whose overall scores in the lists are the highest • Top-k documents w.r.t. some given keywords: • Have for each keyword, a ranked list of documents • Then, find the k documents whose overall scores in the lists are the highest Center for E-Business Technology

  6. Execution Cost of Top-k Algorithm • # of accesses to the lists are key • Two types of access to the lists • Sorted Access (SA) • Reads next item in the list (starts with the first data item) • Random Access (RA) • Looks up a given data item in the list by its identifier • Execution cost of a top-k algorithm A over a database D Cost(A, D) = (num_SA * cost_SA) + (num_RA * cost * RA) Center for E-Business Technology

  7. Problem Definition • Assumption: • Scoring function is monotonic, i.e. sf(x) ≤sf(y) if x< y • Many of the popular aggregation functions are monotonic, • Sum, Min, Max, Avg, … • Given • m lists of n data items (also called a database) • A monotonic scoring function • An integer k such that k ≤ n • Objective • Find the k data items whose overall score is the highest, while minimizing execution cost Center for E-Business Technology

  8. Related Work • Fagin’s Algorithm (FA, 1999) • Simple Algorithm • Do sorted access in parallel to the lists until at least k data items have been seen in all lists • Do random access as needed to each of the lists to find the local score and compute the overall score of data item Center for E-Business Technology

  9. Related Work (contd.) • FA Example (k=3) Center for E-Business Technology

  10. Related Work (contd.) • Threshold Algorithm (TA, 1999~2001) • The most efficient algorithm (so far) over sorted lists • The basis for many TA-style distributed algorithms • Proposed independently by several groups • [Nepal and Ramakrishna, ICDE99] • [Fagin, Lotem and Naor, PODS01] • [Güntzer, Kießling and Balke, ITCC01] • Similar to FA in doing sorted access to the lists, but with a different stopping condition • After seeing each data item, TA does random access to other lists to read the data item’s score in all lists • It uses a threshold (T) to predict maximum possible score of unseen items • Based on the last scores seen in the lists under sorted access • It stops when there are at least k seen data items whose overall score ≥ T Center for E-Business Technology

  11. Related Work (contd.) • TA Example (k=3) Sortedaccess Phase 1 : d1‘s overall Score 30+21+14 < Position 1’s threshold 88 → continue, no replace Y = {d1, d2, d3} Phase 2 : d4’s overall Score 28+13+25 < Position 2’s threshold 84 → continue, replace d1 Y = {d3, d4, d5} … Phase 6 : d3, d5 and d8’s overall Score > Position 6’s threshold 63 → stop Y = {d3, d5, d8} Center for E-Business Technology

  12. BPA • Main idea: take into account the positions (and scores) of the seen items for stopping condition • Enables BPA to stop much sooner than TA • Best position • the greatest seen position in a list such that any position before it is also seen • Thus, we are sure that all positions between 1 and best position have been seen • Stopping condition • Based on best positions overall score, i.e. the overall score computed based on the best positions in all lists Center for E-Business Technology

  13. BPA Algorithm • Do sorted access in parallel to each list Li • For each data item seen in Li • Do random access to the other lists to retrieve the item’s score and position • Maintain the positions and scores of the seen data item • Compute best position in Li • Compute best positions overall score • Stop when there are at least k data items whose overall score ≥ best positions overall score Center for E-Business Technology

  14. BPA Example Sortedaccess • Phase 1 • Sorted access to Pos. 1 (d1, d2, d3) • Random access and obtain local score and position : P1={1,4,9}, P2={1,6,8}, P3={1,5,8}, bp1=bp2=bp3=1 • Best position score(30+28+30=88) > overall score of the highest data items {d1, d2, ,d3} (65,63,70) → continue • Phase 2 • Sorted access to Pos. 2 (d4, d5, d6) • Random access : P1={1.2,4,7,8,9}, P2 = {1,2,4,6,8,9}, P3={1,2,4,5,6,8}, bp1=bp2=bp3=2 • Best position score(28+27+29=84) > overall score of the highest data items {d3, d4, d5} (70,66,70) → continue • Phase 3 • Sorted access to Pos. 3 (d9, d7, d8) • Random access : P1=P2={1,2,3,4,5,6,7,8,9}, P3={1,2,3,4,5,6,8,9,10}, bp1=bp2=9, bp3=6 • Best position score(11+13+19=43) < overall score of the highest data items {d3, d5, d8} (70,70,71) → stop Center for E-Business Technology

  15. BPA Analysis • Lemma 1 • The number of sorted (random) accesses done by BPA is always less than or equal to that of TA. In other words, BPA stops always as early as TA • Theorem 1 • If the scoring function f is monotonic, the BPA correctly finds the top-k answers • Theorem 2 • The execution cost of BPA over any database is always less than or equal to that of TA • Theorem 3 • The execution cost of BPA can be (m-1) timeslower than that of TA, where m is the number of lists Center for E-Business Technology

  16. BPA Optimization : BPA2 • Issues • Redundancies with BPA and TA • It may access some data items several times under sorted access in different lists • Main optimization • Uses the direct access mode • Retrieves the data item which is at a given position in a list • Avoids re-accessing data via sorted or random access • In BPA, a data item may be accessed several times in different lists • In BPA2, no data item in a list is accessed more than once • Manages best positions of a list by • Bit array or B+-tree over the list Center for E-Business Technology

  17. BPA2 • For each list Li do in parallel • Let bpi be the best position in Li. Initially set bpi=0 • Continually do direct access to position (bpi + 1) • Do random access to the other lists to retrieve the scores of the seen data item in all lists • After each direct or random access to a list, update the best position of the list • Stop when there are at least k data items whose overall score ≥ best positions overall score Center for E-Business Technology

  18. BPA2 Analysis • Theorem 4 • No position in a list is accessed by BPA2 more than once • Theorem 5 • The number of accesses to the lists done by BPA2 can be approximately (m-1) times lower than that of BPA Center for E-Business Technology

  19. Performance Evaluation • Implementation of TA, BPA and BPA2 • To study the performance in the average case • Synthetic data sets • Uniform • Gaussian • Correlated • Metrics • Execution cost • Customized for centralized systems • Cost of a random access is (log n) times of a sorted access • Number of accesses • Useful in distributed systems • Response time • Over a machine with a 2.4 GHz Intel Pentium 4 Center for E-Business Technology

  20. Experiment Center for E-Business Technology

  21. Experiment (contd.) Center for E-Business Technology

  22. Conclusion • BPA • Over any database, it stops as early as TA • Its execution cost can be (m-1) times lower than that of TA • BPA2 • Avoids re-accessing data items via sorted and random access, without having to keep data at the query originator • The number of accesses to the lists done by BPA2 can be about (m-1) times lower than that of BPA • Validation and performance evaluation • BPA and BPA2 outperform TA by significant factors • Future Work • BPA-style algorithms for P2P systems Center for E-Business Technology

More Related