1 / 25

Evaluating top-k Queries over Web-Accessible Databases

Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano. Presented By Bhushan Chaudhari University of Texas at Arlington. Overview . More importance to top-k results

hamlet
Download Presentation

Evaluating top-k Queries over Web-Accessible Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University of Texas at Arlington

  2. Overview • More importance to top-k results • Fagin’s algorithm talks about effective differentiation between top-results by various ways e.g. FA, TA • Here we discuss about more larger scenario in terms of web-accessible databases • Assumption: Mapping of keywords typed from search text box to appropriate related modules (Web-accessible databases) • Larger query response times for probing web sources • Tries to exploit the parallel access offered by web

  3. Introduction • We never expect exact answers from search engine but the most nearest possible tuples • Difference between querying a general search engine and dedicated search engine e.g. Google vs Amazon • The paper tries to define the problem using example of restaurants “ problem of finding nearest available restaurants given the current place, rating and price”

  4. Approach • Thinking beyond relational databases • Web accessible sources storing information about rating of restaurants, map provider system etc. Rating => Zagat-Review website Price => New York Times’s NYT-Review website Address => MapQuest website • Scenario where databases are geographically and functionally different but are related “in some way” • Assumption: 1. The interface required for accessing web sources is in place the dependency can be handled 2. The dependency constraints are handled

  5. Approach (continued ..) • Can be compared with a similar scenario with several multimedia systems which are more closely connected • Here we try to use the intrinsic parallel nature of web • We issue probes to various sources in parallel and try to improve upon the final query processing time • Assumption: Mapping of keywords typed in search text box to routing it to appropriate related modules (Web-accessible databases) • Larger query response times for probing web sources • Tries to exploit the parallel access offered by web

  6. Data and Query models • The ordering is bases upon how closely the tuple matches with given query • Assignment of different weight to different attribute • Sources • S-Source: Provides list of objects in order of their scores e.g. Rating provider website Zagat-Review • R-Source: Provides score of random object e.g. Map-Quest for providing distance • SR-Source: Source that provides both kind of access • U(t) : Upper bound score for t • Uunseen : Score upper bound of any object not yet retrieved • E(t) : Expected score for t

  7. Query Model (continued ..) • Getting all k scores with S sources can be expensive • Therefore availability of SR sources is important for this approach • Initially we assume that all object know about all other object • If any score is not possible to get then that can be replaced with some default value e.g. Opening of any new restaurant, it might not be ranked by other referencing websites

  8. Sequential Query Processing Strategy • This strategy returns sorted unseen objects that might not be probed by other source • Or it can return already seen object with source that needs to be probed randomly for getting the corresponding score

  9. TA strategy • Processes top-k queries over SR sources • Algorithm retrieves the next “best” object via sorted access • Probes all its unknown scores via random access • Computes the final score for object • At any given time keeps track of top-k tuples available • When no unretrived object can have a score higher than current top k tuples, the solution is reached

  10. Improvements upon TA • The assumption for bounded buffer is removed and none of the object is discarded until algorithm returns • Because same objects might be referenced again by different SR source • For selection queries of nature,p1^p2^…^pn • The calculation of each predicate pi can be expensive to calculate • Key idea is to order the evaluation to minimize expected execution time • The order is decided by, Rank(pi) = 1-selectivity(pi)/cost-per-object(pi)

  11. Improvements upon TA (Continued..) • Let w1, w2, …w2 be the weights of sources D1,D2,..,Dn • Let e(Ri) be the expected score of randomly picked object Ri • Then the expected decrease in U(t) after probing Ri for object t is, di = wi * (1-e(Ri)) • We sort the sources in decreasing order of their rank, where rank for a source Di is defined as, Rank(Ri) = di/tR(Ri) • Thus we favor fast sources that might have large impact on final score of object

  12. Upper Strategy • Upper allows more flexible probes in which sorted and random accesses can be interleaved even when some objects have been partially probed • When a probe completes the Upper decides whether- • to perform sorted-access probe on source to get new objects • to perform “most promising” random access probes on some objects

  13. Upper Strategy (Continued..)

  14. Upper Strategy (Continued..) Selection of further probes will again depend upon the weight for that source and our ranking function

  15. Parallel Query Processing Strategy • The query processing is bound to take long processing time • Web databases exhibit high and variable latency • Attempt to maximize the source-access parallelism to minimize query processing time • Source Access Constraints • Possibility of access restrictions, variance in loads and network capabilities • The number of parallel probes for source Di can be controlled

  16. Parallel Query Processing Strategy • Adapting the TA strategy • When a source Di becomes available pTA chooses which object to probe for that source • It can be optimized by not probing objects whose final score cannot exceed that of the top-k objects already seen • The object is put on the “discarded” objects list • pUpper Strategy • If t is expected to be one of the top-k objects all random accesses on sources for which t’s attribute score is missing will be considered • Otherwise only fastest probes expected to discard t are considered

  17. Evaluation Settings • Local sources • Real Web Accessible sources • Mix of SR and R sources

  18. Evaluation Results • Sequential Algorithms – Local Database

  19. Evaluation Results • Sequential Algorithms –Web Database

  20. Evaluation Results • Parallel algorithms - Local Database

  21. Evaluation Results • Parallel algorithms - Web Database • pUpper is faster than pTA • pUpper carefully selects the probs for each object • It considers probing time and source congestion to make probing choices per object-level • Results in better use of parallelism and faster query processing

  22. Conclusion • Probe interleaving greatly improves query execution time • Upper is desirable when source shows moderate to high random access time • The approach in this paper exploits the source access constraint of web very well • Extension of this model to capture more expressive web interfaces is possible

  23. References • Optimal Aggregation Algorithms for Middleware. PODS 2001 • Ronald Fagin, Amnon Lotem, Moni Naor • Evaluating Top-k Queries over Web-Accessible Databases. ICDE 2002 (Compact Version) • Nicolas Bruno, Luis Gravano, Amelie Marian • Evaluating Top-k Queries over Web-Accessible Databases. ACM 2004 (Full Version) • Nicolas Bruno, Luis Gravano, Amelie Marian

More Related