280 likes | 397 Views
This document presents advancements in XML querying through the introduction of XQuery and SQL/XML, designed specifically for efficiently managing XML data. It explores the need for schema flexibility, techniques for indexing XML documents in IBM's DB2, and the implications for large databases with varying schemas. The work emphasizes teaching the application of new XML query languages, sharing user experiences, and ensuring effective document filtering. It highlights essential query mechanisms and indexing strategies using XML patterns, aimed at maximizing database performance and usability.
E N D
On the Path to Efficient XML Queries Andrey Balmin, Kevin Beyer, Fatma Özcan IBM Almaden Research Center Matthias Nicola IBM Silicon Valley Lab
New languages = new abilities + new pitfalls • XQuery • A new query language designed specifically for XML data • SQL / XML • Added XML as a data type, including XQuery sequences • Added XQuery as a sublanguage
Purpose • Teach users of the new XML query languages • Teach the teachers • Share our users’ experiences • Influence languages
Focus • Large databases with many moderate XML documents • Schema flexibility is required • Many schemas in one collection • No schema validation used • Schemas with xs:any • Documents like Atom Syndication and RSS that allow any extension • Therefore • Document filtering is primary concern • Limited type inference • Any data is possible
Index eligibility Index Eligibility: We say that an index I is eligible to answer predicate P of query Q, if for any collection of XML documents D, the following holds: Q(D) = Q( I( P,D )). Where I( P,D ) is the set of XML documents produced, by probing index I with predicate P. This is not as obvious as it is in relational databases.
XML indexes in DB2 • Index a linear XPath pattern over a column as a particular datatype CREATE INDEX index-name ON table(xml-column) USING 'pattern' AS type pattern ::= namespace-decls? (( / | // ) axis? ( name-test | kind-test ))+ axis ::= @ | child:: | attribute:: | self:: | descendant:: | descendant-or-self:: name-test ::= qname | * | ncname:* | *:ncname kind-test ::= node() | text() | comment() | processing-instruction() type ::= varchar | double | date | timestamp
Query pattern index pattern CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Can use the index: more restrictive for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i • Cannot use the index: less restrictive for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@* > 100 ]return $i
Match index and query predicate data type CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Can use the index: numeric predicate and index for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i • Cannot use the index: string predicate for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[lineitem/@price > "100" ]return $i
Data Types for Joins CREATE INDEX o_custid ON orders(orddoc) USING XMLPATTERN '//custid' AS double CREATE INDEX c_custid ON customer(cdoc) USING XMLPATTERN '/customer/id' AS double • Cannot use the indexes: unknown comparison type for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid = $j/idreturn $i • Can use the indexes: at least one cast required for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid/xs:double(.) = $j/id/xs:double(.)return $i
SQL/XML Query Functions • XMLQuery Scalar function that returns an (possibly empty) XQuery sequence for every row • XMLExists Predicate that returns true iff the XQuery sequence produced is not empty • XMLTable Produces a table with one row for each item in the row-producing XQuery sequence, and with one column per column-producing XQuery expression. The columns may be XQuery sequences or cast to simple SQL types.
Result XMLQuery does not filter rows (usually) • Cannot use the index: SELECT XMLQuery(‘ $order//lineitem[ @price > 100 ] ‘ passing orddoc as "order")FROM orders • Can use the index: VALUES (XMLQuery(’ db2-fn:xmlcolumn("ORDERS.ORDDOC") //lineitem[ @price > 100 ] ')) • Can use the index: db2-fn:xmlcolumn('ORDERS.ORDDOC') //lineitem[ @price > 100 ]
XMLExists filter rows (usually) • Can use the index SELECT ordid, orddocFROM ordersWHERE XMLExists(‘ $order//lineitem[ @price > 100 ]‘ passing orddoc as "order") • Cannot use the index: false exists SELECT ordid, orddocFROM ordersWHERE XMLExists(‘ $order//lineitem/@price > 100‘ passing orddoc as "order") Need XMLTest which uses XQuery’s Effective Boolean Value
XMLQuery + XMLExists vs. XMLTable • Can use the index SELECT ordid, XMLQuery(‘$order//lineitem[@price > 100] ’ passing orddoc as "order")FROM ordersWHERE XMLExists(‘$order//lineitem[@price > 100] ’ passing orddoc as "order") • XMLTable: More efficient and less redundant SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem)
Predicates in XMLTable column expressions • Can use the index SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem) • Cannot use the index SELECT o.ordid, t.lineitem, t.priceFROM orders o, XMLTable(‘ $order//lineitem ’ passing o.orddoc as "order" COLUMNS "lineitem" XML BY REF PATH '.', "price" DECIMAL(6,3) PATH '@price[. > 100]‘ )as t(lineitem, price)
Joining XML Values in SQL/XML • Can use index on product/id, but not p.id SELECT p.name, o.orddocFROM products p, orders oWHERE XMLExists(‘$order//lineitem/product[ id eq $pid ] ‘ passing o. orddoc as "order", p.id as "pid") • Can use index on p.id, but not product/id SELECT p.name, o.orddocFROM products p, orders oWHERE p.id = XMLCast( XMLQuery(‘ $order//lineitem/product/id ‘ passing o. orddoc as "order") as VARCHAR(13)) Need to unify XQuery and SQL data types
Joining XML Values in SQL/XML • Probably cannot use XML indexes: SQL types differ from XML SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLCast( XMLQuery(‘ $order/order/custid ’ passing o.orddoc as "order") as DOUBLE) =XMLCast( XMLQuery(‘ $cust/customer/id ’ passing c.cdoc as "cust") as DOUBLE) • Can use XML indexes SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLExists(‘ $order/order[ custid/xs:double(.) = $cust/customer/id/xs:double(.) ] ‘ passing o.orddoc as "order", c.cdoc as "cust")
XQuery Let Clauses • Can use the index for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')for $item in $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn $ord/lineitem[ @price > 100 ] • Cannot use the index for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')let $item:= $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn <result>{ $ord/lineitem[ @price > 100 ]}</result>
XQuery Let Clauses • Can use the index for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderwhere $ord/lineitem/@price > 100return <result>{ $ord/lineitem }</result> • Same as above for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderlet $price := $ord/lineitem/@pricewhere $price > 100return <result>{ $ord/lineitem }</result>
Context is everything • $i is bound to the document node for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') return $i/order/lineitem • $j is bound to <my_order> for $j in ( for $o in db2-fn:xmlcolumn('ORDERS.ORDDOC')/order return <my_order>{ $o/* }</my_order> )return $j/my_order/lineitem
Remember the dot • Produces a type error: no document node at root let $order := <new_order>{ db2-fn:xmlcolumn('ORDERS.ORDDOC') /order[custid > 1001] } </new_order>return $order[ //customer/name ] • Absolute path expressions is a shorthand for fn:root(.) treat as document-node(). • Absolute path expressions are bad style
Construction and View Composition • Want to rewrite this… let $view := for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC')/ order/lineitem return <ordered>{ $i/@quantity, $i/product/@price, <pid>{ $i/product/id/data(.) }</pid> }</ordered>for $j in $viewwhere $j/pid = '17‘return $j/@price
Construction and View Composition • … into this for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') /order/lineitemwhere $i/product/id/data(.) = '17‘return $i/product/@price • but…
Construction and View Composition • Data type changed to untypedAtomic • id is string: comparison is now an error • id is long: comparison is now as double instead of long • List types are concatenated • Error for duplicate @price attributes lost • New node identity lost • Parent axis is broken Any sequence should live in tree without change.Separate identity from construction?
Remember the namespaces • Index definition and query must match namespaces CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Cannot use the index. Which is right? declare default element namespace "http://ournamespaces.com/order";for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i
Elements and text nodes differ CREATE INDEX PRICE_TEXT ON orders.orddocUSING XMLPATTERN '//price' AS varchar • Can not use index for $ord in db2-fn:xmlcolumn(“ORDERS.ORDDOC”) /order[ lineitem/price/text() = “99.50” ]return $ord • Element might have more data than just text <price>99.50<unit>USD</unit></price>
Attributes are shy • No attributes //*//node() • Only attributes //@*//attribute::node() • Empty result due to “principle node kind” //@*/self:*
Between predicates are not obvious • Might not be between: multiple prices lineitem[ price > 100 and price < 200 ] • Between or error lineitem[ price gt 100 and price lt 200 ] • Always between lineitem/price/data()[ . > 100 and . < 200 ] • Between if not list type lineitem/price[ . > 100 and . < 200 ] • Between if not list type lineitem[ @price > 100 and @price < 200 ]
Conclusions • Easy to make mistakes without schema constraints • Many subtle differences in expressions • Improve construction composition • Unify SQL and XQuery type systems • Add XMLTest to SQL/XML