PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries

PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries • Vagelis Hristidis University of California, San Diego • Nick Koudas AT&T Research • Yannis Papakonstantinou University of California, San Diego

Example

Example ORDER BY 0.01· Mileage + 0.6·Year + 0.03· Price

Example Problem: Retrieve WHOLE relation ORDER BY 0.01· Mileage + 0.6·Year + 0.03· Price

Example PREFER retrieves only part of relation Problem: Retrieve WHOLE relation ORDER BY 0.01· Mileage + 0.6·Year + 0.03· Price

Applications Such preference queries are used in Web sites like: • www.Zagat.com ( restaurants) • www.personallogic.com (online retailer)

Definitions - Problem statement • A preference query orders the tuples of a relation according to a function of the attribute values. eg: 0.01· Mileage + 0.6·Year + 0.03· Price • Goal is to produce top-K answers of a preference query, retrieving the minimum # of tuples

Our Approach PREFER materializes a number of ranked views of the relation and uses them to efficiently answer to preference queries.

Our Approach Ranked view 0.075*Price + 0.8*Year Year Ranked view 0.08*Price + 0.2*Year 0.2 0.08 Price

Our Approach Ranked view 0.075*Price + 0.8*Year Year Ranked view 0.08*Price + 0.2*Year 0.35 0.2 Preference query: 0.07*Price + 0.35*Year 0.08 0.07 Price

PREFER Architecture Preprocessing stage • Relation • Space constraints • Discretization of ranked views’ vectors. Views Creation Which ranked views should we materialize?

PREFER Architecture Which ranked view should we use to answer to a specific preference query? Runtime Process Preprocessing stage Query • Relation • Space constraints • Discretization of ranked views’ vectors. View Selection index of mat. views • Query • Ranked View id Pipelining Algorithm Views Creation Mat.Views Which ranked views should we materialize? How to use a preference view to answer to a preference query Output results

Watermark Ranked View , ordered by 0.02*Mileage+0.4*Year+0.04*Price Result , ordered by 0.01*Mileage+0.6*Year+0.03*Price t1 last tuple Watermark = 14.26

Watermark Calculating the Watermark

How to use a ranked view to answer a preference query (cont’d) PipelineResults Algorithm Result , ordered by 0.01*Mileage+0.6*Year+0.03*Price Ranked View , ordered by 0.02*Mileage+0.4*Year+0.04*Price • Calculate Watermark for t1, which is 14.26 t1

How to use a ranked view to answer a preference query (cont’d) PipelineResults Algorithm Result , ordered by 0.01*Mileage+0.6*Year+0.03*Price Ranked View , ordered by 0.02*Mileage+0.4*Year+0.04*Price • Calculate Watermark for t1, which is 14.26 • Find prefix of view with fv greater than watermark value and sort them by fq t1

How to use a ranked view to answer a preference query (cont’d) PipelineResults Algorithm Result , ordered by 0.01*Mileage+0.6*Year+0.03*Price Ranked View , ordered by 0.02*Mileage+0.4*Year+0.04*Price • Calculate Watermark for t1, which is 14.26 • Find prefix of view with fv greater than watermark value and sort them by fq • Output tuples up to t1 t1

How to use a ranked view to answer a preference query (cont’d) PipelineResults Algorithm Result , ordered by 0.01*Mileage+0.6*Year+0.03*Price Ranked View , ordered by 0.02*Mileage+0.4*Year+0.04*Price • Calculate Watermark for t1, which is 14.26 • Find prefix of view with fv greater than watermark value and sort them by fq • Output tuples up to t1 • Repeat using first unprocessed as t1 t1

How to use a ranked view to answer a preference query (cont’d) PipelineResults Algorithm • Calculate Watermark for t1, which is 8.3 • Find prefix of view with fv greater than watermark value and sort them by fq • Output tuples up to t1 • Repeat using first unprocessed as t1 t1

Define coverage V1 covers q1: At most k tuples are retrieved from V1 in order to output first result of q1. Year Ranked view 0.8*Price + 0.2*Year q1 0.35 0.2 V1 Preference query: 0.7*Price + 0.35*Year 0.8 0.7 Price

Which ranked view should we use to answer to a specific preference query? Ranked view 0.75*Price + 0.8*Year Year Ranked view 0.8*Price + 0.2*Year 0.2 0.8 Price

Which ranked view should we use to answer to a specific preference query? Ranked view 0.75*Price + 0.8*Year Year Ranked view 0.8*Price + 0.2*Year q1 0.35 0.2 V1 Preference query: 0.7*Price + 0.35*Year 0.8 0.7 Price V1 covers q1

Which ranked views should we materialize? ViewSelection Algorithm while (not all preference vectors in [0,1]n covered) Randomly pick v[0,1]n and add it to the list of views L VIEWS for i = 1 to C do select v L that covers the maximum number of uncovered vectors in [0,1]n VIEWSVIEWSv                    

Which ranked views should we materialize? (cont’d) ViewSelection Algorithm while (not all preference vectors in [0,1]n covered) Randomly pick v[0,1]n and add it to the list of views L VIEWS for i = 1 to C do select v L that covers the maximum number of uncovered vectors in [0,1]n VIEWSVIEWSv                                        

Which ranked views should we materialize? (cont’d) ViewSelection Algorithm while (not all preference vectors in [0,1]n covered) Randomly pick v[0,1]n and add it to the list of views L VIEWS for i = 1 to C do select v L that covers the maximum number of uncovered vectors in [0,1]n VIEWSVIEWSv                                        C = 3

Constraint on # of views Maximum coverage problem using the minimum # of materialized views is NP-Hard. Greedy Heuristic is approximation for maximum coverage.

Related Work • Preference Query Framework [AW00] • Top-k queries • Joins • Fagin [F99,F96,F01], equijoins of ordered data • Selections [reduce top-k selection to range query] • Histograms to estimate cutoff [Chaudhuri&Gravano 99] • Probabilistic model [Donjerkovic&Ramakrishnan 99] • Partitioning [Carey & Kossman 97,98]

Related Work The Onion Technique (Sigmod 2000). Main observation: the points of interest lie on the convex hull of the tuple space. Drawbacks of Onion: • Does not scale • Computing the convex hull is very computationally intensive • Not efficient if the domain of an attribute has a small cardinality • Not efficient for more than the top-1 result

Experiments Measured parameters • # attributes • size of relation • # views • constraint on max # tuples retrieved

Parameters of Experiments • synthetic datasets • 3 to 5 attributes • 10,000 to 500,000 tuples • random & correlated data • discretization of 0.1 or 0.05

Experiments (cont’d) Dual PII CPU, 512MB RAM, 4 attr, 50,000 tuples, 34 Views

Experiments (cont’d) 4 attr, constraint = 500 tuples, discretization = 0.1

Experiments (cont’d) 500,000 tuples, constraint = 500 tuples, discretization = 0.05...0.1

Experiments (cont’d) 4 attr, discretization = 0.1

Experiments (cont’d) 50,000 tuples, 3 attr, discretization = 0.05

More Resources www.db.ucsd.edu/PREFER • PREFER demo • PREFER Application • Construct Materialized Views • Issue preference queries MS Windows, on top of Oracle DBMS

Conclusions • Methodology to efficiently answer to top-K linearly weighted queries • Algorithm that uses a ranked view to answer to a preference query • Ranked materialized views were used • Experimental evaluation

PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries