1 / 31

Query Optimization for Semistructured Data

Query Optimization for Semistructured Data. Jason McHug, Jennifer Widom Stanford University. - Rajendra S. Thapa. ………..Road Map. Lore System Query Execution Engine Statistic and cost model Performance Results. Lore Data Model - OEM. Data Guide. Path Expression. Simple Path Expression

Download Presentation

Query Optimization for Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Optimization for Semistructured Data Jason McHug, Jennifer Widom Stanford University - Rajendra S. Thapa

  2. ………..Road Map Lore System Query Execution Engine Statistic and cost model Performance Results

  3. Lore Data Model - OEM

  4. Data Guide

  5. Path Expression Simple Path Expression • specifies a single-step navigating in the database DBGroup.member y • denotes variable y ranges all member-labeled sub-objects of the object assigned to x Path Expression • ordered list of simple path expressions DBGroup.Member x, x.Age y -variable y ranges over all objects that can be reached by starting with the DBGroup object, following an edge labeled Member, then following an edge labeled Age.

  6. Query: SELECT x FROM DBGroup.Member x WHERE exists y in x.Age: y<30 <Member> <Name>Smith</Name> <Age>28</Age> <Office>Gates 252 </Office> <Office> <Building> CIS </Building> <Room>411 </Room> </Office> </Member> Query language Result:

  7. Lore architecture

  8. Lore architecture Textual Interface Query Processing Parsing Preprocessor Logical Query Plan Generation Query Optimization Physical Query Plan Generation Execution of Physical Query Plan Data Engine

  9. SELECT x FROM DBGroup.Member x WHERE exists y in x.Age: y<30 Queries can be executed in many ways Top down Hybrid Bottom Up

  10. Query Select x from A.B x where exists y in x.C: y = 5 A • top down would explore only this path • - only one path A.B.C • bottom-up would visit all leaf objects • with value 5 and their parents D D B C C C 5 5 5 Top-down preferred

  11. Query Select x from A.B x where exists y in x.C: y = 5 A • Many A.B.C paths • But only a leaf satisfying the predicate • bottom-up is a good candidate B B B C C C 4 4 5 Bottom-up preferred

  12. Query Select x from A.B x where exists y in x.C: y = 5 A D B B B B D B C C C 4 4 5 Hybrid preferred

  13. Query Execution Engine • Logical Query Plans -logical query plan operators - structure of the plan • Physical Query Plans -operators - some physical plans • Statistics and Cost Model • Plan Enumeration

  14. Query Execution Engine Logical Query plans Logical operators Discover Chain Glue Create Temp Project --- --- --- • Variable binding • a variable x in the query is said to be bound if object o has been assigned to x • Evaluation • an evaluation of a query plan (or sub-plan) is a list of all variables appearing in the plan along with the object(if any) bound to each variable. • Rotation

  15. Representation of a Path expression in the logical query plan x.B y, y.C z, z.D v Chain Discover(z,”D”,v) Chain Discover(x,”B”,y) Discover(y,”C”,z)

  16. Complete logical query plan SELECT x FROM DBGroup.Member x WHERE exists y in x.Age: y<30 Project(t2) CreatTemp(x,t2) Glue Chain Glue Name(“DBGroup”,t1) Discover(t1,”Member”,x) Exists(y) Select(y,<30) Discover(x,”Age”,y)

  17. Query Execution Engine Physical Query plans x Operators Scan(x, l, y) Lindex(x, l, y) Pindex(Path Expression, x) Bindex(l, x, y) Name(x, n) Vindex(Op, Value, l, x) --- --- --- l l l a c b y = {a, b, c}

  18. Some physical plans for a simple logical Query Plan A.B x, x.C y Chain Discover (A,”B”,x) Discover (x,”C”,y) Logical Query Plan

  19. physical plans A.B x, x.C y NLJ NLJ Lindex (x,”C”,y) Name (t, A) Scan (A,”B”,x) Scan (x,”C”,y) Scan Plan Lindex (t,”B”,x) Lindex Plan

  20. more physical plans... A.B x, x.C y NLJ Name (t, A) Scan (x,”C”,y) Pindex(“A.B x, x.C y”, y) Bindex (t,”B”,x) Pindex Plan Bindex Plan

  21. how physical plans are produced. • Each logical plan node creates an optimal physical plan given a set of bound variable. • During plan enumeration we track 1. Whether the variable is bound or not 2. Which plan operator has bound the variable 3. All other plan operators that use the variable 4. Whether the variable is stored within a temporary result.

  22. how physical plans are produced. SELECT x FROM DBGroup.Member x WHERE exists y in x.Age: y<30 Logical plan

  23. possible physical plans Fig. (a) Logical plan

  24. possible physical plans Physical plans fig. (c) Logical plan

  25. more physical plan…. Fig. (d) Logical plan

  26. Statistic and Cost Model • Each physical plan is assigned a cost based on the estimated I/O and CPU time required to execute a plan. • The costing procedure is recursive. • I/O first then CPU time to decide the cheaper plan.

  27. Performance Result Experiment 1 A simple query SELECT DBGroup.Movie.Title -11 different query plans - * the best plan uses Lore’s path index to quickly locate all the movie titles - second plan is top-down strategy - the worst plan uses Bindex operators and hash joins

  28. Performance Result Experiment 2 Same query with a Genere subobject having value ‘Comedy’ - point query

  29. Performance Result Experiment 3 - Same point query - all possible plans are not executed - different plans were generated or disallowing the use of particular operator or indexes.

  30. Performance Result Experiment 4 Query selects movies with certain quality rating.

  31. …….future Work • Optimization techniques for branching path expression • a query rewrite that moves Where clause predicates into the From clause and a transformation that introduces a Group-by clause when a large number of paths pass through a small number of objects. • Partially correlated sub-plans • similar to correlated subqueries but rely on the bindings passed between portions of the physical query plan rather than on the query itself. • In the area of statistic • efficient statistics-gathering algorithms • statistic about the location of objects on disk • modification to the cost formulas to generate more accurate cost estimates

More Related