1.08k likes | 1.32k Views
XML in RDBMSs. Why Bother?. Most (structured) data stored in RDBMS Seamless integration of XML and rel. Data Exploit capabilities of RDBMS Scalability, Availability, Performance(?), ... Many people with knowledge around XML-RDBMS Applications Legal documents, decision support on XML
E N D
Why Bother? • Most (structured) data stored in RDBMS • Seamless integration of XML and rel. Data • Exploit capabilities of RDBMS • Scalability, Availability, Performance(?), ... • Many people with knowledge around • XML-RDBMS Applications • Legal documents, decision support on XML • When not to use RDBMS • Streaming data (RSS, SOAP), IR (Google)
Approaches • Store XML as a CLOB/BLOB in relational database • Shredding into relations • Covered in the previous lecture • Manual/middleware based approaches • SQL/XML (now part of commercial products) • Extend SQL with XML data type • Plus: integrates well with relational data • Minus: not clear how it integrates with application, odd „marriage“
Overview of SQL/XML • Goal: standardization of interaction/integration of SQL and XML • Store XML Data in Relational Databases • Publish relational data as XML • Query XML data in RDBMS using XQuery • Part 14 of the SQL standard • First edition in SQL:2003 • Limited functionality: storage and publishing • Second edition in SQL:2006 • More complete integration of XQuery + XQuery Data Model • Transforming XML data into relational (table) format • Validating an XML value according to some XML Schema
XML Data Type in SQL • A new data type (like varchar, date, numeric) • No comparison operators defined • Can have optimized internal representation (different from character string) • In SQL:2003, XML values can be • An XML document • The content of an XML element (“well-formed external parsed entity”) • Null value • In SQL:2006, XML values can be • Any legal value of the XQuery Data Model • Null value • XQuery empty sequence is different from null value • Proper superset of XML values in SQL/XML:2003 • Additional, optional type modifiers can restrict possible XML values • XML(SEQUENCE), XML(CONTENT(ANY)), XML(DOCUMENT(UNTYPED)), …
XML Data Type CREATE TABLE employees ( id CHAR(6), lastname VARCHAR (30), ..., resume XML );
XMLPARSE & XMLSERIALIZE • SQL functions to convert from/to character strings and BLOBs • Examples: INSERT INTO employees VALUES (‘123456’, ‘Smith’, …, XMLPARSE(DOCUMENT ‘<?xml version="1.0"?> <resume xmlns="http://www.res.com/resume"><name> … </name><address> …</address>...</resume>’ PRESERVE WHITESPACE) ); SELECT e.id, XMLSERIALIZE(DOCUMENT e.resume AS VARCHAR (2000)) AS resume FROM employees AS e WHERE e.id = ‘123456’;
Publishing Functions • SQL functions/operators for generating XML constructs (elements, attributes, ...) within a query • XMLELEMENT generates an XQuery element node • XMLCONCAT concatenates XML values • XMLFOREST generates multiple XQuery element nodes • XMLAGG aggregates XML values across multiple relational tuples • XMLCOMMENT generates an XQuery comment node • XMLPI generates an XQuery processing instruction node • XMLCAST converts SQL to XML values and vice versa • XMLQUERY evaluates an XQuery expression • XMLVALIDATE validates a given XML value according to some XML Schema • XMLDOCUMENT wraps an XQuery document node around an XML value • XMLTEXT generates an XQuery text node
XMLELEMENT • Produces an XML value that corresponds to an XML element, given: • An SQL identifier that acts as its name • An optional list of namespace declarations • An optional list of named expressions that provides names and values of its attributes • An optional list of expressions that provides its content • An option how to handle NULL content: • EMPTY ON NULL (default since SQL/XML:2003 behavior) • NULL ON NULL • ABSENT ON NULL • NIL ON NULL • NIL ON NO CONTENT
XMLELEMENT • XMLELEMENT can produce simple element structures: SELECT e.id, XMLELEMENT (NAME "Emp", e.fname || ' ' || e.lname) AS "result“ FROM employees e WHERE ... ;
XMLELEMENT • XMLELEMENT can produce nested elements (with mixed content): SELECT e.id, XMLELEMENT (NAME "Emp", 'Employee ', XMLELEMENT (NAME "name", e.lname ), ' was hired on ', XMLELEMENT (NAME "hiredate", e.hire ) ) AS "result“ FROM employees e;
XMLELEMENT • XMLELEMENT can take subqueries as arguments: SELECT e.id, XMLELEMENT (NAME "Emp", XMLELEMENT (NAME "name", e.lname ), XMLELEMENT (NAME "dependants", (SELECT COUNT (*) FROM dependants d WHERE d.parent = e.id)) ) AS "result“ FROM employees e WHERE … ;
XMLATTRIBUTES (within XMLELEMENT) • Attribute specifications must be bracketed by XMLATTRIBUTES keyword and must appear directly after element name and optional namespace declaration. • Each attribute can be named implicitly or explicitly. SELECT e.id, XMLELEMENT (NAME "Emp", XMLATTRIBUTES (e.id, e.lname AS "name") ) AS "result" FROM employees e WHERE … ;
XMLNAMESPACES (within XMLELEMENT) • Namespace declarations are bracketed by XMLNAMESPACES keyword and must appear directly after element name. SELECT empno, XMLELEMENT(NAME "admi:employee", XMLNAMESPACES(’http://www.admi.com’ AS "admi"), XMLATTRIBUTES(e.workdept AS "admi:department"), e.lastname ) AS "result" FROM employees e WHERE e. job = ’ANALYST’;
XMLCONCAT • Produces an XML value given two or more expressions of XML type. • If any of the arguments evaluate to the null value, it is ignored. SELECT e.id, XMLCONCAT (XMLELEMENT (NAME "first", e.fname), XMLELEMENT ( NAME "last", e.lname) ) AS "result" FROM employees e;
XMLFOREST • Produces a sequence of XML elements given named expressions as arguments. Arguments can also contain a list of namespace declarations and an option how to handle NULL content: • NULL ON NULL (default since SQL/XML:2003 behavior) • EMPTY ON NULL • ABSENT ON NULL • NIL ON NULL • NIL ON NO CONTENT • Element can have an explicit name: • e.salary AS "empSalary" • Element can have an implicit name, if the expression is a column reference: • e.salary
XMLFOREST - Example SELECT e.id, XMLFOREST (e.hire, e.dept AS "department") AS "result" FROM employees e WHERE ... ;
XMLAGG • An aggregate function, similar to SUM, AVG, etc. • The argument for XMLAGG must be an expression of XML type. • Semantics • For each row in a group G, the expression is evaluated and the resulting XML values are concatenated to produce a single XML value as the result for G. • An ORDER BY clause can be specified to order the results of the argument expression before concatenating. • All null values are dropped before concatenating. • If all inputs to concatenation are null or if the group is empty, the result is the null value.
XMLAGG - Example SELECT XMLELEMENT ( NAME "Department", XMLATTRIBUTES ( e.dept AS "name" ), XMLAGG (XMLELEMENT (NAME "emp", e.lname)) ) AS "dept_list", COUNT(*) AS "dept_count" FROM employees e GROUP BY dept ;
XMLAGG and ORDER BY SELECT XMLELEMENT ( NAME "Department", XMLATTRIBUTES ( e.dept AS "name" ), XMLAGG (XMLELEMENT (NAME "emp", e.lname) ORDER BY e.lname) ) AS "dept_list", COUNT(*) AS "dept_count" FROM employees e GROUP BY dept ;
XMLCOMMENT & XMLPI • XMLCOMMENT • Creates an XQuery comment node (XML comment) • One mandatory character string input argument XMLCOMMENT (‘This is a comment’) <!--This is a comment--> • XMLPI • Creates an XQuery processing instruction node • Two input arguments: • One mandatory name for the PI (an SQL identifier); a.k.a. the PI target • One optional character string input argument; defines the content of the PI target XMLPI (NAME “includeFile”, ‘/POs/template.hls’) <?includeFile /POs/template.hls?>
XMLTEXT • Creates an XQuery text node • One mandatory character string input argument • Example: SELECT XMLELEMENT (NAME "p", XMLAGG (XMLCONCAT (XMLText (T.Text), XMLELEMENT (NAME "em", T.Emphasized_text) ) ORDER BY T.Seqno ) ) AS “result” FROM T
XMLDOCUMENT • Wraps an XQuery document node around an XML value • According to the rules of XQuery’s document constructor • Useful for converting an arbitrary XQuery sequence into an XML document or a “well-formed external parsed entity”
XMLQUERY • Evaluates an XQuery or XPath expression • Provided as a character string literal • Allows for optional arguments to be passed in • Zero or more named arguments • At most one unnamed argument can be passed in as the XQuery context item • Arguments can be of any predefined SQL data type incl. XML • Non-XML arguments will be implicitly converted using XMLCAST • Returns a sequence of XQuery nodes
XMLQUERY - Examples SELECT XMLQUERY(‘<e>hi</e>’ RETURNING SEQUENCE BY REF) AS “result” FROM T SELECT XMLQUERY(‘for $i in $po/purchaseOrder/customer where $i/zip[@type=“US”]=“95141” return $i/name’ PASSING BY REF po AS “po” RETURNING SEQUENCE) AS “Name_elements” FROM POrders
XMLCAST • Convert an SQL value into an XML value • Values of SQL predefined types are cast to XQuery atomic types using • The defined mapping of SQL types/values to XML Schema types/values • The semantics of XQuery’s cast expression • XMLCAST(NULL AS XML) returns the SQL null value typed as XML • Convert an XML value into an SQL value • XML values are converted to values of SQL predefined types using a combination of • The defined mapping of SQL types to XML Schema types • XQuery’s fn:data function • XQuery’s cast expression • SQL’s CAST specification • An XML value that is the empty sequence is converted to a NULL value of the specified SQL data type • Note: XMLCAST to/from character strings is different from XMLSERIALIZE and XMLPARSE
XMLCAST - Example SELECT XMLCAST ( XMLQUERY (‘<e>hi</e>’ RETURNING SEQUENCE BY REF) AS VARCHAR(20) ) AS “result” FROM T
XMLTABLE • Transforming XML data into table format • Allows for optional arguments to be passed in • Just like XMLQUERY • Element/attribute values are mapped to column values using path expressions (PATH) – the “column pattern” • Provided as a character string literal • Names and SQL data types for extracted values/columns need to be specified • Default values for “missing” columns can be provided • ORDINALITY column can be generated • Contains a sequential number of the corresponding XQuery item in the XQuery sequence (result of the row pattern)
XMLTABLE - Example SELECT X.* FROM POrders P, XMLTABLE (‘$po//customer’ PASSING P.po AS “po” COLUMNS “#num” FOR ORDINALITY, “CID” INTEGER PATH ‘@id’, “Name” VARCHAR(30) PATH ‘name’, “ZipType” CHAR(2) PATH ‘zip/@type’, “Zip” VARCHAR(6) PATH ‘zip’ ) AS “X”
SQL Predicates on XML Type • IS DOCUMENT • Checks whether an XML value conforms to the definition of a well-formed XML document • IS CONTENT • Checks whether an XML value conforms to the definition of either a well-formed XML document or a well-formed external parsed entity • IS VALID • Checks whether an XML value is valid according to a given XML Schema • Does not validate/modify the XML value; i.e., no default values are supplied. • XMLEXISTS • Checks whether the result of an XQuery expression (an XQuery sequence) contains at least one XQuery item
XMLEXISTS - Example SELECT id, bib FROM bibliographies WHERE XMLEXISTS(‘$doc/bib/book/@price’ PASSING bib AS “doc”);
Relation XML XML view relational view XMLELEMENT, … XMLTABLE, … XMLQUERY SQL XML data relational data default view
System Architecture K. Beyer, et al., “System RX: One Part Relational, One Part XML”, SIGMOD 2005
Native XML Storage Requirements • Must support direct access and subdocument updates • Documents need to span pages • Must support XQuery’s node reference semantics • Must support rollback and recovery • Need to reuse of existing relational infrastructure • Table spaces, buffer pools, lock manager and log manager used without modification • Reuse of transactions, concurrency, scalability and recoverability simplified implementation and is essential for coexistence with relational data
Native XML Store Highlights • XML is stored as instances of XQuery Data Model in a structured and type annotated tree • Every node contains pointers to its children and parent • Path expressions are directly executed over the native store • Each node has a unique identifier • A persistent dictionary maps strings to identifiers • Regions of nodes for a given XML document are grouped together on pages and linked by a logical regions index • Direct access to nodes using node identifiers and the regions index
Information for Every Node • Tag name (encoded as unique StringID) • A nodeID (Dewey node identifier) • Node kind (e.g. element, attribute, etc.) • Namespace / Namespace prefix • Type annotation • Pointer to parent • Array of child pointers • Hints to the kind & name of child nodes (for early-out navigation) • For text/attribute nodes: the data itself
XML Node Storage Layout • Node hierarchy of an XML document stored on DB2 pages • Documents that don’t fit on 1 page: split into regions / pages • Docs < 1 page: 1 region, multiple docs/regions per page
XML Storage: “Regions Index” • not user defined, default component of XML storage layer • maps nodeIDs to regions & pages • allows to fetch required regions instead of full documents • allows intelligent prefetching
XML Indexes • Define 0, 1 or multiple XML Value Indexes per XML column • Can index all elements/attributes, but not forced to do so • Index definition uses an XML pattern to specify which elements/attributes to index (and which not to) • XML pattern = XPath without predicates, only child axis, attribute axis, self axis, descendent axis and descendent-or-self axis • XML index maps: (pathID, value) → (nodeID, rowID) • Index any elements or attributes, incl. mixed content • E.g. • CREATE UNIQUE INDEX idx2 ON dept(deptdoc) USING XMLPATTERN '/dept/employee/@id' AS double; • CREATE INDEX idx3 ON dept(deptdoc) USING XMPPATTERN '/dept/employee/name' AS varchar(35);
XML Index Selection • Given an XPath query Q and an XPath Index I • Determine if the index is applicable • Build index pushdown QI : Q = QC (QI) • Index I can be used to answer query Q only if Q I, i.e. every node returned by Q is also returned by V. • XPath containment problem • XPath{[ ],*,//] subset is NP Complete • A. Balmin, F. Özcan, K. Beyer, R. Cochrane, H. Pirahesh, “A Framework for Using Materialized XPath Views in XML Query Processing”, VLDB 2004
When Indexes cannot be Used • When path expressions are in let bindings or return clause, indexes cannot be used for $i in db2-fn:xmlcolumn('T.XMLCOL')/a/b return <Customer> {$i[@c = 10]} </Customer> • Outerjoin semantics prevents index usage • Need to return correct number of empty <Customer> elements • Indexes will skip nodes • A. Balmin, K. Beyer, F. Özcan, “On the Path to Efficient XML Queries”, VLDB 2006
Querying XML Data in DB2 • XQuery/XPath as a stand-alone language • for $d in db2-fn:xmlcolumn(‘dept.deptdoc’)/dept let $emp := $d//employee/name return <EmpList> {$d/@bldg, $emp} </EmpList> • SQL embedded in XQuery • for $d in db2-fn:sqlquery('select deptdoc from dept where deptID = “PR27” ')… • for $d in db2-fn:xmlcolumn(‘dept.deptdoc’)/dept, $e in db2-fn:sqlquery('select xmlforest(name, desc) from unit u’)… • XQuery/XPath embedded in SQL/XML • xmlexists, xmlquery, xmltable • Plain SQL for full-document retrieval
Hybrid Query Compiler • Two major design decisions • No static typing to support dynamic, changing schemas • No XPath normalization into explicit FLWR blocks • Steps are not broken into selections, as in XPERANTO • Path expressions consisting of solely navigational steps are expressed as an atomic unit • Built on extensibility features of Starburst
Semantic Modeling • XQuery is represented via an internal Query Graph Model (QGM) • QGM • Is a semantic network used to represent the data flow in a query • Consists of operations and arcs, representing data flow between operations • QGM is augmented with native constructs which are specific to XML, such as complex navigation
Query Rewrite Transformations • QGM generated by the parser captures the most general semantics, and is not efficient in most cases • Two goals • Optimize the data flow by consolidating operations in the graph • Normalize QGM representation for cost-based optimizer • Rewrite transformations use heuristics, such as applying selections as early as possible • Rewrite transformations are part of a rule-based engine • Rules organized into classes