Lore a database management system for semistructured data
Download
1 / 46

Lore: A Database Management System for Semistructured Data - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Lore: A Database Management System for Semistructured Data. Why?. Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. Relational DBMS might use null values in this case. May be difficult to decide in advance on a specific schema.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Lore: A Database Management System for Semistructured Data' - tavia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Why?

  • Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema.

    • Relational DBMS might use null values in this case.

  • May be difficult to decide in advance on a specific schema.

    • Data elements may change types.

    • Structure changes a lot (lots of schema modifications).


Semistructured data
Semistructured Data

  • Examples:

    • Data from the web

      • Overall site structure may change often.

      • It would be nice to be able to query a web site.

    • Data integrated from multiple, heterogeneous data sources.

      • Information sources change, or new sources added.


Object exchange model oem
Object Exchange Model (OEM)

  • Data in this model can be thought of as a labeled directed graph.

    • Schema-less and self-describing.

  • Vertices in graph are objects.

    • Each object has a unique object identifier (oid), such as &5.

    • Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc.

    • All other objects that have outgoing edges are called complex objects.


Oem cont
OEM (Cont.)

  • Examples:

    • Object &3 is complex, and its subobjects are &8, &9, &10, and &11.

    • Object &7 is atomic and has value “Clark”.

  • DBGroup is a name that denotes object &1.(Names are entry points into the database).


Oem to xml
OEM to XML

  • Example:

    • <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office></member>

  • This corresponds to rightmost member in the example OEM, where project is an attribute.


Lorel query language
Lorel Query Language

  • Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data.

  • A simple path expression is a name followed by a sequence of labels.

    • DBGroup.Member.Office.

    • Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office.


Lorel cont
Lorel (cont.)

  • Example:

    • select DBGroup.Member.Officewhere DBGroup.Member.Age < 30

  • Result:

    • Office “Gates 252”

    • Office

      Building “CIS”

      Room “411”


Lorel query rewrite
Lorel Query Rewrite

  • Previous query rewritten to:

    • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

  • Comparison on age transformed to existential condition.

    • Since all properties are set-valued in OEM.

    • A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.


Lorel query rewrite1
Lorel Query Rewrite

  • Why?

    • Breaking query into simple path expressions necessary for query optimization.

    • Need to explicitly handle coercion.

      • Atomic objects and values. 0.5 < “0.9” should return true

      • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.


Lorel cont1
Lorel (cont.)

  • General path expressions are loosely specified patterns for labels in the database.(‘|’ disjunction, ‘?’ label pattern optional)

  • Example:

    • select DBGroup.Member.Namewhere DBGroup.Member.Office(.Room%|.Cubicle)? like “%252”

  • Result:

    • Name “Jones”Name “Smith”


Query and update processing
Query and Update Processing

  • Query is parsed

  • Parse tree is preprocessed and translated to new OQL-like query.

  • Query plan constructed.

  • Query optimization.

  • Opt. query plan executed.



Iterators and object assignments
Iterators and Object Assignments

  • Use recursive iterator approach:

    • execution begins at top of query plan

    • each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s).

    • pass result tuples up to parent.


Object assignments oas
Object Assignments (OAs)

  • OA is a data structure containing slots for range variables with additional slots depending on the query.

  • Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine.

  • Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects.


Query operators
Query Operators

  • For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression.

    • Scan (StartingOASlot, Path_expression, TargetOASlot)

  • For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot.

  • Other operators include Join, Project, Select, Aggregation, etc.

  • Join node like nested-loop join in relational DBMS.


Query optimization
Query Optimization

  • Does only a few optimizations:

    • Push selection ops down query tree.

    • Eliminate/combine redundant query operators.

  • Explores query plans that use indexes where possible.

    • Two kinds of indexes:

    • Lindex (link index) provide parent pointers impl. as hashing.

    • Vindex (value index) impl. as B+-trees


Indexes
Indexes

  • Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex.

  • Separate B-Trees for each type are constructed.

  • Using Vindex for comparison (e.g. Age < 30) consider the following:

    • If type is string, do lookup in String Vindex

    • If can convert to real the do lookup in String-coerced-to-real Vindex.

    • If type is real?


Other issues
Other issues

  • Update query operator example:

    • Update(Create_Edge, OA1, OA5, “Member”)

    • Create edge from results in OA1 to OA5 labeled “Member”.

  • Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot.

    • Objects placed according to first-fit algorithm.

    • Supports large objects spanning multiple pages.

    • Objects clustered in depth-first manner (since Scan traverses depth-first).

    • Garbage collector removes unreachable objects.


External data manager
External Data Manager

  • Enables retrieval of information from other data sources, transparent to the user.

  • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source.

  • The spec for an external object includes:

    • Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.


Data guides
Data Guides

  • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog).

  • Why?

    • No explicit schema, how do we formulate meaningful queries?

    • Large databases (can’t just view graph structure).

    • What if a path expression doesn’t exist (waste).

  • Each possible path expression is encoded once.



Dataguides as histograms
DataGuides As Histograms

  • Each object in the dataguide can have a link to its corresponding target set.

    • A target set is a set of oids reachable by that path.

      • TS of DBGroup.Member.Age is {9, 13}.

    • This is a path index. Can find set of objects reachable by a particular path.

    • Can store statistics in DataGuide (more in next paper).

      • For example, the # of atomic objects of each type reachable by p.


Conclusions
Conclusions

  • Takes advantage of the structure where it exists.

  • Handles lack of structure well (data type coercion, general path expressions).

  • Query language allows users to get and update data from semistructured sources.

    • DataGuide allows users to determine what paths exist, and gives useful statistical information



Oem vs xml
OEM vs. XML

  • OEM’s objects correspond to elements in XML

  • Sub-elements in XML are inherently ordered.

  • XML elements may optionally include a list of attribute value pairs.

  • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.


Indexes1
Indexes

  • Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l.

    • Vindex(“Age”, <, 30,y) places into y objects with age < 30.

  • Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l.

    • Lindex(x, “Age”, y) places into x all parents of y via label “Age”.


Indexes cont
Indexes (cont.)

  • Bindex(l, x, y) finds all parent-child object pairs connected by a label l.

    • Bindex(“Age”, x, y) locates all parent-child pairs with label Age.

  • Pindex(PathExpression, x) placed into x all objects reachable via the path expression.

    • Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C.

    • Uses DataGuide.


Simple query
Simple Query

  • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

  • Possible plans:

    • Top-down (similar to pointer-chasing, nested-loops join)

    • Use Vindex to check y < 30, traverse backwards from child to parent using Lindex(bottom-up).

    • Hybrid, both top down and bottom up. Meet in middle.


Select xFrom A.B xWhere exists y in x.C: y = 5


Query plan generation overview
Query Plan Generation (Overview)

  • Logical query plan generator creates high-level execution strategy.

  • Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space.


Logical query plans cont
Logical Query Plans (cont.)

  • Glue node represents a ‘rotation point’ that has as its children two independent subplans.

    • Rotating the order between independent components yields different plans.

    • Marks place where execution order is not fixed.

  • Discover node chooses best way to bind variables x and y.

  • Chain node chooses best evaluation of a path expression.


Logical query plan for:Select xFrom DBGroup.Member xWhere exists y in x.Age: y<30

from clause

where clause



Physical query plans cont
Physical Query Plans (cont.)

  • Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l.

    • Top-down (pointer chasing).

  • Lindex plan is bottom-up approach.

  • Bindex: Locate edges whose label appears infrequently in database.

  • NLJ: left subplan passes variables to right subplan.


Statistics
Statistics

  • I/O metric uses estimated # of objects fetched.

  • For every label subpath p of length <= k:

    • # Of atomic objects of each type reachable by p

    • Min, and max values of all atomic objects of each type reachable by p

    • # Of instances of path p, denoted |p|

    • # Of distinct objects reachable by p, denoted |p|d

    • # Of l-labeled subobjects of all objects reachable by p

    • # Of incoming l-labeled edges to any instance of p, denoted |pl|


Plan enumeration
Plan Enumeration

  • Doesn’t consider joining two simple path expressions together unless they share a common variable.

  • Pindex is used only when path expression begins with a name and no variable except the last is used in the query.

  • Select clause always executes last.

  • Doesn’t try to reorder multiple independent path expressions.


Results
Results

  • Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges.

  • Experiment 1: Select DB.Movie.Title

    • Best plan is Pindex, followed by top-down

    • Worst plan is Bindex, with hash joins.


Results cont
Results (cont.)

  • Experiment 2: All Movies with a Genre of “Comedy”

    • Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre


Results cont1
Results (cont.)

  • Experiment 3: Query with two existentially quantified variables in the where clause.

  • Errors due to bad estimates of atomic value distributions and set operation costs.


Results cont2
Results (cont.)

  • Experiment 4: Select movies with certain quality rating.

  • Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up.


Conclusions1
Conclusions

  • Cost estimates are accurate and select the best plan most of the time

  • Execution times of best and worst plans for a given query can differ by many orders of magnitude.

  • Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).


ad