1 / 30

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Boolean + Ranking: Querying a Database by K-Constrained Optimization. Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang. Information retrieval. Traditional databases. Ranking query: Top 5 ranked by gpa. Boolean query:

lev-levine
Download Presentation

Boolean + Ranking: Querying a Database by K-Constrained Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

  2. Information retrieval Traditional databases Ranking query: Top 5 ranked by gpa Boolean query: dept = CS and year = 2 Many queries naturally combine Boolean and ranking Find top answers + B: dept = CS and year = 2 Qualifying constraint R: gpa Quantifying function Database applications on Web

  3. Motivating scenarios • Data retrieval: • Find houses in certain price range with good price/sqrft ratio • Data analysis: • Find products with highest sale increase in consecutive years Selecth.addressfrom House h Whereh.price ≤ 200k ν h.price ≥ 400k Order byh.size/|h.price-300k|Limit1 Selecth.addressfrom House h, CrimeRate c Whereh.price ≤ 200k ν h.price ≥ 400k andh.zipcode = c.zipcode Order byh.size/|h.price-300k| *c.crimerate-1Limit10 Selectitemidfrom Sales s1, Sales s2 Wheres1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit10

  4. Boolean + Ranking form a coherent goal function • Boolean B + Ranking R = Goal function G For a tuple t R(t) if B(t) is true 0if B(t) isfalse G(t) = B(t)*R(t) = (ie, lowest score)

  5. G D The nature of Boolean + Ranking is K-constrained optimization query • Optimize goal function G over database D h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Goal function G Database D

  6. What is the query evaluation mechanism? Boolean query Ranking query + How to answer?

  7. … … D D B R Goal function G Current techniques lack of global search mechanism • If evaluated as separate operators • If search by an overall goal function G as a ranking function Boolean query B Ranking query R Boolean query B Ranking query R • Current techniques optimize only condition-by-condition • Current techniques restrict G to be monotonic

  8. G OPT* D D Our thesis:Evaluate query as its nature suggests! Function optimization of G Optimize G over D Discrete state search over D

  9. We view compound index as discrete space

  10. b1 0-250 250-600 b2 b3 0-100 100-250 250-350 350-600 b6 b7 ……… 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1 We view compound index as discrete space Price (k) 600 1 350 2 5 250 4 3 100 6 size 1500 3000 4000 4500

  11. We view compound index as discrete space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 M22 M32 M23 M33 ……… 250 2 5 1 4 3 … 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1

  12. M22 M32 M23 M33 We view compound index as discrete space conceptually, combined space Price (k) Mij =(ai, bj) b1 0-250 250-600 600 b2 b3 M11 1 350 0-100 100-250 250-350 350-600 b6 2 b7 5 ……… 250 2 5 1 4 3 100 … M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 3000-4500 a2 a3 0-1500 1500-3000 3000-4000 4000-6000 a6 a7 ……… 5 1

  13. How to perform the search in the space? • What is the search mechanism? • How to conceptually view the index space of D for search • How to guide the search? • How to use function G to focus the search

  14. Challenge 1: What is the search mechanism?

  15. We encode as A* because it’s optimal • What A* is: Finding the shortest path • Why we choose: Completeness and optimality with proper heuristics • Complete: guarantee to find shortest path • Optimal: visit least number of nodes origin 3 5 1 5 2 1 7 9 6 destination

  16. Encoding our problem into shortest path is challenging • How to encode: • a tuple  a path? • score of tuple distance of path?

  17. Therefore, we encode K-constrained opt. as: • How to encode a tuple to a path? • Adding a virtual target t* only reachable through tuples • How to encode maximal tuple with minimal path? • Quality of path depends solely on the tuple it passes by • For tuple state t D(t, t*) = - G(t) • For two states r, u D(r, u) = 0 M11 0 0 M22 M32 M23 M33 0 0 … M66 M67 M76 M75 M56 M77 M55 0 0 4 2 5 1 - G(1) - G(4) t*

  18. Challenge 2: How to guide the search?

  19. We use function opt. to sketch the landscape of G • Function optimization measures quality of states • Function optimization enables: • 1. How to define heuristics? • 2. How to configure space? • 3. Where to start the search?

  20. 1. Define admissible heuristics: Measure tightest upper bound • To guarantee completeness • A* requires admissible heuristics, ie, estimate optimistically • To ensure admissible heuristics • Function optimization gives tightest upper bound • Analytical approaches • Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region

  21. M22 M32 M23 M33 2. Configure descending space: disconnect uphills • To guarantee optimality • A* requires descending heuristics • To ensure descending heuristics • Remove uphill links M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1

  22. M22 M32 M23 M33 Find right start point: Start from local optima • To guarantee correctness • Every tuple state must be reachable from start states • Taking only downhills requires start with high points • To ensure reachability • Initial states should contain all local optima M11 … M55 M75 M56 M66 M77 M67 M76 4 2 5 1

  23. Putting together: Executing A* on the configured space top-down M11 M22 M32 M23 M33 … M67 M76 M57 M55 M75 M56 M66 M77 4 2 5 1 • Search is implemented as priority queue driven traversal

  24. Putting together: Executing A* on the configured space top-down M11 • Bottom-up approach is always better than top-down M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 4 2 5 1 bottom-up M11 M22 M32 M23 M33 … M57 M67 M76 M55 M75 M56 M66 M77 2 5 1 4

  25. Experiments • Comparison vs. • Boolean then ranking • Ranking then boolean • Metrics: node accessed = Nl + Nt • Settings: • Benchmark queries over real dataset • Controlled queries over synthetic dataset

  26. BR_clustered BR_unclustered OPT* Benchmark queries • Datasets: • 19,706 real estate listing crawled online • Queries • Q1: size * bedrms/| price-450k| : [40k<=price<=50k] • Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] • Q3: size/price : [bedrms=3 ν bedrms=4] Q1 Q2 Q3

  27. Controlled queries • Datasets • Three randomly generated datasets of 100k points • Uniform, gaussian, logvariatenormal • Queries • Linear average queries: (eg, 0.4*a + 0.6*b) • Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) • Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)

  28. Conclusion • Problem • Study K-constrained optimization queries as boolean + ranking • Abstraction • Encode K-constrained optimization into shortest path problem • Framework • Develop OPT* to process K-constrained optimization

  29. Thank you! Questions?

  30. How to implement function optimization? • How do we compare with RankSQL? • If bottom-up is always better, why consider top-down • Computing upper bound for each region is costly • Random vs. sequential I/O • Assuming indices on every attribute? • Materialize state space for every query? • Exponential number of states when attribute grows • Not every attribute has index on it • Selective choose the right index (attribute) to use • We do perform experiment to study how the system scale with #attr • Your algorithm is not optimal because you change the space

More Related