lore a database management system for semistructured data
Download
Skip this Video
Download Presentation
Lore: A Database Management System for Semistructured Data

Loading in 2 Seconds...

play fullscreen
1 / 46

Lore: A Database Management System for Semistructured Data - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Lore: A Database Management System for Semistructured Data. Why?. Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. Relational DBMS might use null values in this case. May be difficult to decide in advance on a specific schema.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lore: A Database Management System for Semistructured Data' - tavia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2
Why?
  • Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema.
    • Relational DBMS might use null values in this case.
  • May be difficult to decide in advance on a specific schema.
    • Data elements may change types.
    • Structure changes a lot (lots of schema modifications).
semistructured data
Semistructured Data
  • Examples:
    • Data from the web
      • Overall site structure may change often.
      • It would be nice to be able to query a web site.
    • Data integrated from multiple, heterogeneous data sources.
      • Information sources change, or new sources added.
object exchange model oem
Object Exchange Model (OEM)
  • Data in this model can be thought of as a labeled directed graph.
    • Schema-less and self-describing.
  • Vertices in graph are objects.
    • Each object has a unique object identifier (oid), such as &5.
    • Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc.
    • All other objects that have outgoing edges are called complex objects.
oem cont
OEM (Cont.)
  • Examples:
    • Object &3 is complex, and its subobjects are &8, &9, &10, and &11.
    • Object &7 is atomic and has value “Clark”.
  • DBGroup is a name that denotes object &1.(Names are entry points into the database).
oem to xml
OEM to XML
  • Example:
    • Jones 46 gates 252
  • This corresponds to rightmost member in the example OEM, where project is an attribute.
lorel query language
Lorel Query Language
  • Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data.
  • A simple path expression is a name followed by a sequence of labels.
    • DBGroup.Member.Office.
    • Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office.
lorel cont
Lorel (cont.)
  • Example:
    • select DBGroup.Member.Officewhere DBGroup.Member.Age < 30
  • Result:
    • Office “Gates 252”
    • Office

Building “CIS”

Room “411”

lorel query rewrite
Lorel Query Rewrite
  • Previous query rewritten to:
    • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30
  • Comparison on age transformed to existential condition.
    • Since all properties are set-valued in OEM.
    • A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.
lorel query rewrite1
Lorel Query Rewrite
  • Why?
    • Breaking query into simple path expressions necessary for query optimization.
    • Need to explicitly handle coercion.
      • Atomic objects and values. 0.5 < “0.9” should return true
      • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.
lorel cont1
Lorel (cont.)
  • General path expressions are loosely specified patterns for labels in the database.(‘|’ disjunction, ‘?’ label pattern optional)
  • Example:
    • select DBGroup.Member.Namewhere DBGroup.Member.Office(.Room%|.Cubicle)? like “%252”
  • Result:
    • Name “Jones”Name “Smith”
query and update processing
Query and Update Processing
  • Query is parsed
  • Parse tree is preprocessed and translated to new OQL-like query.
  • Query plan constructed.
  • Query optimization.
  • Opt. query plan executed.
iterators and object assignments
Iterators and Object Assignments
  • Use recursive iterator approach:
    • execution begins at top of query plan
    • each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s).
    • pass result tuples up to parent.
object assignments oas
Object Assignments (OAs)
  • OA is a data structure containing slots for range variables with additional slots depending on the query.
  • Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine.
  • Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects.
query operators
Query Operators
  • For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression.
    • Scan (StartingOASlot, Path_expression, TargetOASlot)
  • For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot.
  • Other operators include Join, Project, Select, Aggregation, etc.
  • Join node like nested-loop join in relational DBMS.
query optimization
Query Optimization
  • Does only a few optimizations:
    • Push selection ops down query tree.
    • Eliminate/combine redundant query operators.
  • Explores query plans that use indexes where possible.
    • Two kinds of indexes:
    • Lindex (link index) provide parent pointers impl. as hashing.
    • Vindex (value index) impl. as B+-trees
indexes
Indexes
  • Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex.
  • Separate B-Trees for each type are constructed.
  • Using Vindex for comparison (e.g. Age < 30) consider the following:
    • If type is string, do lookup in String Vindex
    • If can convert to real the do lookup in String-coerced-to-real Vindex.
    • If type is real?
other issues
Other issues
  • Update query operator example:
    • Update(Create_Edge, OA1, OA5, “Member”)
    • Create edge from results in OA1 to OA5 labeled “Member”.
  • Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot.
    • Objects placed according to first-fit algorithm.
    • Supports large objects spanning multiple pages.
    • Objects clustered in depth-first manner (since Scan traverses depth-first).
    • Garbage collector removes unreachable objects.
external data manager
External Data Manager
  • Enables retrieval of information from other data sources, transparent to the user.
  • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source.
  • The spec for an external object includes:
    • Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.
data guides
Data Guides
  • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog).
  • Why?
    • No explicit schema, how do we formulate meaningful queries?
    • Large databases (can’t just view graph structure).
    • What if a path expression doesn’t exist (waste).
  • Each possible path expression is encoded once.
dataguides as histograms
DataGuides As Histograms
  • Each object in the dataguide can have a link to its corresponding target set.
    • A target set is a set of oids reachable by that path.
      • TS of DBGroup.Member.Age is {9, 13}.
    • This is a path index. Can find set of objects reachable by a particular path.
    • Can store statistics in DataGuide (more in next paper).
      • For example, the # of atomic objects of each type reachable by p.
conclusions
Conclusions
  • Takes advantage of the structure where it exists.
  • Handles lack of structure well (data type coercion, general path expressions).
  • Query language allows users to get and update data from semistructured sources.
    • DataGuide allows users to determine what paths exist, and gives useful statistical information
oem vs xml
OEM vs. XML
  • OEM’s objects correspond to elements in XML
  • Sub-elements in XML are inherently ordered.
  • XML elements may optionally include a list of attribute value pairs.
  • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.
indexes1
Indexes
  • Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l.
    • Vindex(“Age”, <, 30,y) places into y objects with age < 30.
  • Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l.
    • Lindex(x, “Age”, y) places into x all parents of y via label “Age”.
indexes cont
Indexes (cont.)
  • Bindex(l, x, y) finds all parent-child object pairs connected by a label l.
    • Bindex(“Age”, x, y) locates all parent-child pairs with label Age.
  • Pindex(PathExpression, x) placed into x all objects reachable via the path expression.
    • Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C.
    • Uses DataGuide.
simple query
Simple Query
  • select Ofrom DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30
  • Possible plans:
    • Top-down (similar to pointer-chasing, nested-loops join)
    • Use Vindex to check y < 30, traverse backwards from child to parent using Lindex(bottom-up).
    • Hybrid, both top down and bottom up. Meet in middle.
query plan generation overview
Query Plan Generation (Overview)
  • Logical query plan generator creates high-level execution strategy.
  • Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space.
logical query plans cont
Logical Query Plans (cont.)
  • Glue node represents a ‘rotation point’ that has as its children two independent subplans.
    • Rotating the order between independent components yields different plans.
    • Marks place where execution order is not fixed.
  • Discover node chooses best way to bind variables x and y.
  • Chain node chooses best evaluation of a path expression.
slide36
Logical query plan for:Select xFrom DBGroup.Member xWhere exists y in x.Age: y<30

from clause

where clause

physical query plans cont
Physical Query Plans (cont.)
  • Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l.
    • Top-down (pointer chasing).
  • Lindex plan is bottom-up approach.
  • Bindex: Locate edges whose label appears infrequently in database.
  • NLJ: left subplan passes variables to right subplan.
statistics
Statistics
  • I/O metric uses estimated # of objects fetched.
  • For every label subpath p of length <= k:
    • # Of atomic objects of each type reachable by p
    • Min, and max values of all atomic objects of each type reachable by p
    • # Of instances of path p, denoted |p|
    • # Of distinct objects reachable by p, denoted |p|d
    • # Of l-labeled subobjects of all objects reachable by p
    • # Of incoming l-labeled edges to any instance of p, denoted |pl|
plan enumeration
Plan Enumeration
  • Doesn’t consider joining two simple path expressions together unless they share a common variable.
  • Pindex is used only when path expression begins with a name and no variable except the last is used in the query.
  • Select clause always executes last.
  • Doesn’t try to reorder multiple independent path expressions.
results
Results
  • Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges.
  • Experiment 1: Select DB.Movie.Title
    • Best plan is Pindex, followed by top-down
    • Worst plan is Bindex, with hash joins.
results cont
Results (cont.)
  • Experiment 2: All Movies with a Genre of “Comedy”
    • Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre
results cont1
Results (cont.)
  • Experiment 3: Query with two existentially quantified variables in the where clause.
  • Errors due to bad estimates of atomic value distributions and set operation costs.
results cont2
Results (cont.)
  • Experiment 4: Select movies with certain quality rating.
  • Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up.
conclusions1
Conclusions
  • Cost estimates are accurate and select the best plan most of the time
  • Execution times of best and worst plans for a given query can differ by many orders of magnitude.
  • Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).
ad