Native XML Databases

Native XML Databases Lior Schejter Erez Hadad

Outline • Introduction: What? Why? How? • An example XML DBMS: Natix • Query processing • Natix query engine • XML DBMS Storage • What is stored in an XML-DBMS? • Storage Management: Natix Storage • Storing documents and Indexes • Transaction Management • Logging & Recovery • Locking

Native XML DBMS – What? • A DBMS native to XML • Providing programming interfaces to manage and query XML data (1 or more documents) • Using a full blown, consistent API • XML is the natural way of accessing data in this kind of DBMS • Defining, Querying etc. • Providing all familiar features of DBMS • Transactions • Recovery • Multi threading

Native XML Databases – Why? • XML structure provides information • Complex data model • Ever tried representing an organizational hierarchy in a DBMS? • Flexible data model • Both structured and semi structured data • Inherent XML “behavior”  DBMS will be optimized for XML • Think about querying \ storing \ updating thousands of XML documents

Native XML Databases – How? • XML can be thought of as a: • Text document (tags, simple text etc.) • A data model (nodes, children, siblings) • Document centric – Storing and retrieving the entire document (or large parts of it) • Fast document construction and storage • Slow on queries, and retrieving data • Data centric – Expressing the entire data in an internal data structure • Fast queries • Slow document retrieval • A question of granularity

Natix • Natix – An XML DBMS • Developed at Manheim University, Germany. • Designed from scratch for storing and accessing XML data • Supports XPath, XQuery • Not designed to any specific language and \ or environment

Natix - Architecture

Natix - Architecture • Storage layer – Manage all persistent data storage • Service Layer – Provides DBMS functionality • Binding Layer – Modules that map data and requests from other APIs to the Natix engine interface

Natix Query Execution Engine • Able to execute all queries in a typical XML query language (e.g. XQuery). • Expressive: Small number of powerful parameterized operators • 2 main components: • Natix Physical Algebra (NPA) • Algebraic operators, their composition etc. • Natix Virtual Machine (NVM) • Plans used in the algebraic operators

NQE – The Rough Guide • NPA works on sequences of tuples • Each tuple holds values which can be a number, a string or a node handle • XML node handles can point to any node type • NPA operators are implemented as iterators • NPA operators take programs for the NVM as parameters • Usually passed at construction

Natix Virtual Machine • NVM commands operate on register sets • An NVM command may access several register sets • Global (X) register set for global (but specific to current plan of execution) data. • Z, Y register sets for arguments (pass between operators) • Y register sets are used only for binary operators • A reference to register sets is passed between operators to avoid unnecessary copying

NVM program examples CMP_LEQ_SI4_ZCX 1 55 2 EXIT_F 2 ARITH_ADD_A_SI4_ZZX 1 2 3 PRINT_SI4 3 STOP X3 = Z1 + Z2 PRINT X3 if (Z1 > 55) then exit

XML NVM Commands • NVM has about 150 XML specific commands • Copying documents handles, comparing fragments, traversing the XML document tree, printing etc. • Mainly commands which correspond to XPath axes (child, sibling, descendant etc.)

Example: UnnestMap operator • UnnestMap • Logically: takes a set valued expression and returns a single tuple for each element in the result set, flattening hierarchy 1 level deep • Physically: takes 3 programs and uses them as an iterator: • init – Initialize the first tuple to be returned • step – compute the next tuple • Fin – finalize, cleanup

UnnestMap Operator init: step: fin: no finish program XML_CHILD_ZZ 1 2 XML_VALID_ZX 2 3 EXIT_F 3 MV_XML_ZX 2 4 XML_SIBLING_NEXT_XX 4 4 XML_VALID_XX 4 3 EXIT_F 3 MV_XML_XZ 4 2

Natix Physical Algebra • Operators for selection and binding combination are borrowed from the relational and object databases contexts • select, join, map, group etc. • The main concern: variable binding and result construction operators for XML

NPA – Query plans • Every plan of execution has a scan operation at the bottom of it: scanning a document and retrieving its root in a tuple • e.g. Expression Scan • UnnestMap and PathScan are used for variable bindings as well • An XPath expression can be translated into a sequence of UnnestMap operations • Or a single PathScan operation, which also eliminates duplicates, like in XPath

e.g. SELECT NPA Operators Any subplan

NPA Examples • Example DTD:

NPA Example 1 • Query: <result> { FOR $c IN document(“bib.xml”)/bib/conference WHERE $c/year > 1996 RETURN <conference> <title>{$c/title}</title> <year>{$c/year}</year> </conference> } </result>

Function Calls Tuples NPA Example 1 • The query plan:

NPA Example 2 • Query: <bib> { FOR $a IN document(“bib.xml”)//conference/article/author RETURN <author> <name first={$a/@first} last={$a/@last} /> <articles> { FOR $b IN document(“bib.xml”)//conference/article, $c IN $b/author WHERE $c/@first=$a/@first AND $c/@last=$a/@last RETURN <article>{$b/title}</article> } </articles> </author> } </bib>

NPA Example 2 • The query plan:

Part II:Storage & Transactions inNative XML DBMS

What Is Stored In A Native XML-DB? • XML Documents • The data itself • The DBMS tries to maintain imported documents as close to their original form as possible • Data Definition Schemas • XML schemas, RelaxNG schemas, DTDs • Used for: • Validating documents • Organizing data on disk • Validating and optimizing queries, constructing result sets • Semantics: types, operations

What Is Stored In A Native XML-DB? • Collections / Roots: • Bindings of XML documents into sets • According to type or relevance • A document may belong to more than one set • A collection may be related to a schema • Collections are valid instances of the data model and can be processed through queries for $d in collection(“foo”) where $d/Book/Author/Lastname = “Dante” return $d

What Is Stored In A Native XML-DB? • “Standard DB” components: • Indexes • Speed up query execution • Stored functions / procedures / triggers • Embed business logic in storage • Server-local processing – reduce network traffic • Create views – abstractions of data • Access control data • Users, resources, groups, permissions

Storage Management • Usually, it is impossible to hold all the DB components in main memory • Several orders of magnitude smaller than secondary storage (disk) • A common technique is to keep only a few fixed objects in memory and load the rest on-demand • Analogous to virtual memory / disk caching mechanisms of operating systems

Storage Management In Natix Records Internal Database Structure (Slotted Page Segments) Segments Page Interpreter Page Interpreter Pages Buffer Manager Partitions Physical Storage

How To Store XML Documents in a DB? • Flat Stream: • Each XML document is a byte stream (e.g. a file or DBMS BLOB) • Fast handling of large sequential chunks or whole documents [document-centric] • Poor random access • Requires parsing of XML • Example: Web server’s HTML file tree

How To Store XML Documents in a DB? • Meta Modeling: • Separately store each element of the data model of an XML document (e.g. in a DBMS) • Analogy to RDBMS: Entities and relations (of an ERD model) stored in separate records in tables • Fast random access [data-centric] • Slow processing of whole documents out of (possibly) thousands of separate records • Mechanism required for “translating” between data models, e.g. XML <=> Relational

How To Store XML Documents in a DB? • Mix of FS and MM: • Redundant: Store each document both as a byte stream and as a collection of records • Read access is optimal: match the case • Write access has high overhead: update both types of storage in each operation • Hybrid: Define a “granularity threshold” • A “small structure” object is stored as a flat stream inside a single record of a database • A “large structure” object is divided into several records • Leverage time between whole-document and per-node operations

Natix XML Storage • A hybrid approach: • Each database record contains a single subtree of an XML document • A dynamic granularity threshold, adapting to size and structure of documents at runtime • A subtree can grow and split into several records • Small subtrees can be merged into a larger subtree in a single record

Types of Stored Nodes in Natix • Aggregate nodes: inner nodes of the tree, containing their respective child nodes • Helper aggregate nodes: “virtual” aggregate nodes used for grouping subsets of children of an actual aggregate node into subtrees in records • Literal nodes: leaf nodes each containing an unparsed stream of bytes • Proxy nodes: “virtual” nodes that point to subtrees contained in other records f1 h1 f1 p1

Natix Storage Example XML File: <f1> <f2>..</f2> <f3>..</f3> <f4> <f7/> <f8>..</f8> </f4> <f5/> <f6>..</f6> </f1> Logical Tree: f1 f2 f3 f4 f5 f6 f7 f8

Natix Storage Example Physical Tree: r1 f1 p1 p2 p3 r2 r3 r4 h1 f4 h2 f2 f3 f7 f8 f5 f6

Modifying Documents In Natix • The physical tree is regarded as a B-Tree of records maintained balanced • If, when inserting a subtree, a record becomes too big • Split the record into a separator part, left part and a right part • Insert the separator into the parent record • The algorithm may repeat in the parent • Similarly, a delete operation can result in record merger

Modifying Documents In Natix 1. Add node f10 to a record containing the following subtree: S f1 L R f2 f6 f11 f12 f13 f14 f3 f4 f5 f7 f10 4. Right Forest R is induced by the subtree of f7 and all the descendants of S located right of f7 f8 f9 2. f7 is the Split Node 3. Separator S is the path from the root up to but not including f7 5. Left Forest L is induced by the rest of the nodes

Modifying Documents In Natix 1. In L & R, subtrees with sibling roots are grouped using helper aggregates rparent 3. The separator connects to the partition records through proxie nodes f1 2. Each subtree is put into a separate new partition record p1 p1 4. The separator either replaces the proxy in the parent or forms a new root f6 p1 p1 S r1 r2 r3 r4 h1 h2 f2 f5 f7 f10 f11 f12 f3 f4 f8 f9 f13 f14 L R

Indexes InNative XML-DBs • Using indexes accelerate evaluation of queries by quickly locating elements / values / text in the DB • Indexes may be created • as fixed parts of the storage systems, or • upon user request, or • automatically due to repeated use of certain queries • Index granularity may vary: • Point to each node that contains a specific key, or just to the containing document • A trade-off between size & construction speed of the index vs. its effectiveness

Indexes InNative XML-DBs • Common types of indexes: • Value indexes: list the locations of each typed value of a node: • E.g., locations of the integer value 1492. • Element indexes: list the locations of elements in documents, preserving hierarchy • Locate an element of a specific type (//footnote) or in a specific context (/appendix/footnote) • E.g., Tamino index structure, Natix XASR

Indexes In Native XML-DBs • Full Text Indexes: List the location of text within the content of elements • Common technique: inverted files • “Location of word”: offset in file and / or in document hierarchy • Becomes more useful when the document is less structured • In the future, text index mechanism is expected to resemble a modern search engine: • Handle word equivalence (single/plural, synonyms) • Ranked matching (degree of proximity instead of true/false)

2 8 4 7 3 1 12 6 11 5 9 10 Natix eXtended AccessSupport Relations (XASR) 1. The document is traversed in DFS order. Each node is assigned a value dmin upon entry and dmax upon exit. dmin is also the unique node id. 2. An XASR table is constructed as following: bioml organism organelle organelle label label “cytoskeleton” “mitochondrion” 3. During query evaluation, path connectors (‘/’ or ‘//’) are resolved through join operations on the XASR table: The join predicate for ‘/’ is: xi.docID = xi+1.docID and xi.dmin = xi+1.parent The join predicate for ‘//’ is: xi.docID = xi+1.docID and xi.dmin < xi+1.dmin and xi.dmax > xi+1.dmax

Transaction Management • An XML DBMS provide support for transactions • A sequence of operations on XML items that can be either committed or rolled-back altogether • Transaction execution follows ACID properties: • Atomicity: Each transaction should either complete or have no effect at all • Consistency: Each transaction should transform the DB from one consistent state to another • Isolation (serialization): Concurrently-executing transactions should behave as if they’re executing in some sequential order • Durability: Once committed, a transaction’s effect on the DB is permanent

Transaction Management • Consistency is achieved by properly defining the transaction boundaries • Example: when moving money between bank accounts, update both accounts in one transaction • Responsibility of the application programmer • The other properties are provided by the DBMS • Atomicity and durability are provided through logging and recovery • Isolation is maintained through locking

Logging And Recovery • Logging and recovery provide two important functions: undo and redo • Undo of transactions that are aborted (will not complete) • Enables atomicity • Redo of committed transactions • in case of DBMS failure before transaction results were completely written to disk • Enables durability

Logging And Recovery • The XML DBMS keeps a log of all operations affecting the DB • Each transaction operation that writes to a DB item generates a log record • Write-ahead logging: log then write • All log records of the same transaction are linked chronologically forward (redo) and backward (undo)

Logging And Recovery In Natix • Optimizing L&R for XML hierarchies: • Subsidiary Logging: log records are cached and unified into more compact records before entering the log • One log record of adding a subtree instead of many added-single-node log records • Annihilator Undo: No need to perform undo operations that are covered by later undo operations • Skip remove-node / modify-node operations that are followed by matching remove-subtree operations

Locking Mechanism • In order to ensure that transactions are isolated from each other, each transaction locks its DB resources before operating on them • Locking a resource prevents other transactions from modifying/accessing it until unlocked • Most common locking protocol: S2PL (Strict 2-Phase Locking) • Every resource is locked before its first access • All resources are atomically unlocked together with abort/commit

Native XML Databases

Native XML Databases

Presentation Transcript

XML and Databases

Native XML Databases for Information Systems

XML and Databases

XML and Databases

Historical XML Databases

XML Databases

XML Databases

XML Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

Native XML Databases

XML Databases

XML and Databases

XML and Databases

XML Databases

XML and Databases

XML and Databases