Supporting Top- k join Queries in Relational Databases

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington

Introduction • Often searches are done on multiple features • Each feature produces a different ranking for the query • Must thus join and aggregate rankings on different features

Example • Find location for a house such that the combination of the cost of the house and 5 years tuition at a nearby school is minimal. • Exact location is not predefined in query, per location the house and school features would have to be analyzed.

Motivation • Current techniques decouple join and sorting (ranking) of results. • Sorting is expensive and is a blocking operation. • More apparent if ranking and the joining features are different.

Rank-Join Algorithm • Generate new valid join combinations • Compute score for each combination • For each incoming input, calculate the total score of: • The last seen feature value and the top ranked feature value for all other features in the query. • Store the maximum of these as T (threshold) • Store top k in priority queue. • Halt when lowest value of queue ≥ T

Optimality • Is Instance Optimal over all correct top-K join algorithms. • Guarantees that cost of Rank-Join is O (cost of any other algorithm). • Mathematically: • Cost(Rank-Join) ≤ c*Cost(Any Other Algorithm) + c’ • c is the optimality ratio • c, c’ > 0

Rank-Join Continued … • Join strategy crucial • Recommended: Ripple Join • Alternates between tuples • Flexible in the way it sweeps out (rectangular, etc) • Retains ordering in considering samples • Variant of Rank-Join • Hash Rank Join (HRJN) • Block Ripple Join

Hash Rank Join (HRJN) Operator • Built on idea of hash ripple join • Inputs are as two hash tables • Maintains highest (first) and lowest (last selected) objects from each relation. • Results are added to a priority queue • Advantages: • Smaller space requirement • Can be pipelined

Hash Rank Join (HRJN) Operator: Problems • Local Ranking Problem • Results from three or more input streams • Larger queue sizes • More database accesses • Buffer Problem • Cannot predict how many partial joins will result

HRJN Solutions? • Block Ripple Joins • Do comparisons as blocks • Score-Guided Strategy • If thresholds are very different, then this may be because of the way one of the rankings is larger and descends at a slower rate • Can then take more inputs from the slower growing ranking so that the threshold goes closer to the other thresholds

Optimal Join-Order • Try to have the least number of input records in order to get a correct ranking • No clear way of estimating the order of joins • Have a heuristic – Footrule Distance • Simple measure of similarity among two rankings. • First join the most similar rankings • This would quickly yield a join by accessing fewer records

Rank-Join Algorithm: Benefits • What can it do? • Integrates well with query plans • Produces results as fast as possible • Provides performance guarantees • Minimizes space requirements • Offers a mechanism to determine the best order of joining to execute query optimally. • Can be improved further if random access is available • Can eliminate on-the-fly duplicate elimination

References • “Supporting top-k join queries in relational databases” - Ihab Ilyas, Walid Aref, Ahmed Elmagarmid (2004) • Jing Chen : DBIR Spring 2005, CSE-UT Arlington http://ranger.uta.edu/~gdas/Courses/ Spring2005/DBIR/slides/top-k_join.ppt

Supporting Top- k join Queries in Relational Databases