Answering Top-k Queries Using Views

111 Views

Download Presentation
## Answering Top-k Queries Using Views

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Answering Top-k Queries**Using Views By Gautam Das DimitriosGunopulos Nick Koudas DimitrisTsirogiannis Presented By RajuBuchi PoornimaAncha**AGENDA**Agenda • Introduction • Views • Related Work • Preliminaries • Problems Discussed • Algorithm LPTA • View Selection Problem • Experimental Results**Introduction**I N T R O D U C T I O N • Answering Top-k Queries • Active research topic • Retrieve quickly a number(k) of highest ranking tuples in presence of monotone ranking functions defined on attributes of underlying relations • Algorithms • Threshold Algorithm (TA) by Fagin et. al., • Independently by Guntzer et. al., • Nepal et. al.,**Views**I N T R O D U C T I O N • Materialized Views • A database table that contains the results of the query previously asked. Actually constructed and stored. • Problem Discussed • To find efficient methods of answering a query using a set of previously defined materialized views over the database . • Why Views? • Relevance to a variety of data management problems. • Promised increased in performance. • Views are materialized (incurring a space overhead) with the hope to gain in performance for some queries.**Views**I N T R O D U C T I O N • Views do not specify any selection conditions on the attributes they aim to rank. • Example: (TOP-k) f1=2x1+5x2 f2=x2+2x3 R View2 (V2) Top-3 Query View1 (V1) Top-5 Query**Views – Example Contd…**I N T R O D U C T I O N • Given a top-2 query defined using function f3=3x1+10x2+5x3, we can apply standard top-k algorithm(e.g., TA) using the data from R and obtain answer to the query. • Using Views? • Feasibility • Guarantee an answer • Speed of using R directly vs. Using Views**Related Work**R E L A T E D W O R K • Multimedia Context: Uses ordered lists • Threshold Algorithm: • This algorithm requires the scoring function to be monotonic. • i .e. For tuples t and u,t[i]<u[i], 1≤i≤100, then ScoreQ(t)≤ScoreQ(u). • TA requires that each attribute has an index mechanism that allows all tids to be accessible in sorted order. • A single random access is required to resolve all attributes of a tid. • In our paper we focus on Additive scoring functions(monotonic), where ScoreQ(t)=w1t[1]+ w2t[2]+….+ wmt[m]**Related Work**R E L A T E D W O R K • Variants: • TA-Sorted - Lists are always accessed sequentially and NO random accesses are performed. • PREFER [Hristidis et. al.,] : • Storing multiple copies of ‘R’. • It assumes to utilize only one copy of a relation which is closest to the new query to answer the new query.**Ranking Queries**P R E L I M I N A R I E S • Consider Relation R with m numeric attributes (X1, X2…Xm) • Domi=[lbi, ubi] domain of ith attribute. • Tuple t is viewed as numeric vector t=(t[1], t[2]… t[m]) • Top-k Ranking Queries in SQL-like syntax: • SELECT TOP[k] FROM R WHERE RangeQ ORDER BY ScoreQ • Expressed as a triple Q=(ScoreQ, k, RangeQ) • ScoreQ: Function that assigns a numeric score to any tuple ‘t’. • RangeQ : Boolean function that defines a selection condition for the tuples of ‘R’. • The semantics requires that the system retrieve the k tuples with the top scores satisfying the selection condition.**Ranking Views**P R E L I M I N A R I E S • Materialized Ranking View(V): • Materialized result of the tuples of a previously executed top-k query Q, ordered according to the scoring function ScoreQ. • Q’=(ScoreQ’ , k’, RangeQ’ ) • Corresponding materialized ranking view’ is a set of k(tid, ScoreQ(tid) pairs, ordered by decreasing the values of ScoreQ(tid).**Problems Discussed**• Problem 1: TOP-k QUERY ANSWERING USING VIEWS • Given a set of views and a query Q, obtain an answer to Q combining all the information conveyed by the views in U. • SOLUTION: Algorithm namedLPTA. • Problem 2: VIEW SELECTION • Given a collection of views V={V1, V2 …VR} that includes the base views(thus r ≥ m) and a query Q, determine the most efficient subset U⊆ V to execute Q on. • Such a subset U will be provided as input to LPTA. • Should identify a set of views that can provide an answer to the query and at same time provide the answer faster than running TA on the base set of views, if possible. P R O B L E M S**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A • An adaptation of TA algorithm in the sense that it answers top-k queries using multiple ranking views • Requires the scoring functions of the query & the views to be linear and additive • Sorted access on pairs (tid, scoreQ(tid)) • Views and Queries are of the form V’ = (ScoreV’, n, *) and Q=(ScoreQ, k, *) respectively. • Pseudo code • Example • General Approach**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A • Pseudo code • Initialize top-k buffer to empty. • Retrieve the tids from the views V1 and V2 in a lock-step fashion, in the order of decreasing score. • Retrieve corresponding tuple by random access on R. • Compute score according to f3 and update top-k buffer to contain largest scores. • Check the stopping condition. • Once the stopping condition is satisfied we will have the results in the top-k buffer.**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A • Stopping Condition: • After dthiteration, • let the tuple read from V1= (tid1d, s1d) and V2= (tid2d, s2d) • and minimum score in the top-k buffer be top-kmin • At this point the unseen tuples have to satisfy the following inequalities: ( Domain of each attribute of R = [1, 100]) • 0≤X1, X2, X3≤100 • 2x1 + 5x2 ≤ s1d • x2 + 2x3 ≤ s2d • This will represent a convex region in 3-d space. • unseenmaxwill be the solution to the linear program where we maximize the function f3=3x1+10x2+5x3**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A • Example: (TOP-k Query Answering using Views) View1 (V1) Top-5 Query View2 (V2) Top-3 Query R f1=2x1+5x2 f2=x2+2x3 6 219 7 527 4 202 299 6 6 12 55 82 7 16 99 42 (7,1248) (6,996) {tidid, sid }={(7,1248), (6,996)} Linear Programming Solution with s1d=527 and s2d=219 gives unseenmax= 1388 f3=3x1+10x2+5x3 Query = (f3, k, *)**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A • Example: (TOP-k Query Answering using Views) View1 (V1) Top-5 Query View2 (V2) Top-3 Query R f1=2x1+5x2 f2=x2+2x3 6 219 7 527 4 80 22 90 4 202 6 299 6 12 55 82 {tidid, sid }={(6,996), (4, 910)} Linear Programming Solution with s1d=299 and s2d=202 gives unseenmax= 953.5 f3=3x1+10x2+5x3 Query = (f3, k, *) ≤ top-kmin**stopping**condition LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A R(X1, X2) Top-1 V1 X1 Q tid11 R=(1,1) T=(0,1) tid11 tid21 tid21 V2 X2 O=(0,0) P=(1,0)**d iteration**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM A L G O R I T H M L P T A R(X1, X2) Q: fQ=3x1+10x2+5x3 fV1=2x1+5x2 0 ≤ x1, x2, x3 ≤ 100 2x1 + 5x2 ≤ s1d x2 + 2x3 ≤ s2d fV2=x2+2x3 View1 (V1) View2 (V2) unseenmax≤ top-kmin**LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHM**A L G O R I T H M L P T A Top-1 R(X1, X2) stopping condition V1 Q X1 T=(0,1) R=(1,1) tid11 tid12 tid21 V2 tid22 P=(1,0) O=(0,0) X2**TA Vs. LPTA**T A V S L P T A • LPTA essentially becomes TA when the set of views U equal to the set of base views • In terms of execution cost both have Sequential as well as Random Access • Execution Efficiency: I/O Operations play a significant role – they overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on. • Highly correlated: every sequential access incurs a random access. • Determining factor: • If d = number of lock-step iterations and • r = no. of views, • then running Cost is O(dr).**Conceptual Discussion**V I E W S E L E C T I O N • Given a collection of views Ѵ = {V1,V2,….Vr} that includes base views determine the most efficient subset U ⊆Ѵ to execute the query Q on. • Conceptual Discussion • View Selection in Two Dimensions • View Selection in Higher Dimensions**Conceptual Discussion**V I E W S E L E C T I O N 2D V2 X Q Min top-k tuple A’2 A A1 A’1 R=(1,1) T=(1,0) M V1 B’1 B B2 B’2 Y O=(0,0) P=(1,0)**Conceptual Discussion**V I E W S E L E C T I O N HD For Ѵ = {V1,V2,….Vr} being a set of views for m-dimensional dataset, Q being query, the optimal execution of LPTA requires the use of a subset of the views U ⊆Ѵ such that |U| < m.**View Selection Problem**C O S T E S T I M A T I O N • Compute histograms representing the distribution of scores along each view in U. • Estimate top kmin from Hq by determining the bucket which contains the kth highest tuple. • “Walkdown” these histograms until the stopping condition is reached. • Check stopping condition by linear programming. • When Unseen max< top kmin then perform logarithmic search within last bucket. • Number of sorted accesses ((d-1)n/b + n’)r’. • Running time of algorithm is O((d-1)+log n’)**Select Views(Q,V)**S E L E C T V I E W S • Consider MinCost and MinCurCost = ∞, U={ }, Vє Ѵ-U • Compare the cost estimate for V with MinCurCost, • if EstimateCost< MinCurCost , add V to MinV. • MinCurCost is now is EstimateCost of V. • ∀ V, above steps are followed • When MinCurCost<MinCost, V is added U • This is repeated for all the attributes m considered.**View Selection Algorithms**S E L E C T V I E W S Select Views(Q,V) / Exhaustive : Estimates cost of all possible (rp)subsets of V to select one with minimum cost. Simple Greedy Heuristic : Iterates the set of views , selects the one that reduces the total cost by the greatest amount.**View Selection Algorithms**S E L E C T V I E W S Select Views Spherical(Q,V) : it has to solve linear program just once and is very effective for highly restrictive data sets. Select view By Angles : sorts the view vectors by increasing angle with query vector returning top-m views.**More General Queries & Views**M O R E G E N E R A L Q U E R I E S & V I E W S • Views that Only Materialize their Top-k Tuples • Truncate the histograms • Accommodating Range Conditions • Select the views that cover the range conditions. • Truncate each attribute’s histogram**Performance Evaluation**E X P E R I M E N T A L R E S U L T S (3d) (2d) Real Data, performance comparison of PREFER, LPTA, TA**References**R E F E R E N C E S • Answering Top-k Queries Using Views: Gautam Das, DimitriosGunopulos, Nick Koudas • aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt**THANK YOU**Questions???