1 / 66

Storing XML

Storing XML. Sihem Amer-Yahia AT&T Labs - Research. What’s XML?. W3C Standard since 1998 Subset of SGML (ISO Standard Generalized Markup Language) Data-description markup language HTML text-rendering markup language De facto format for data exchange on Internet Electronic commerce

rmichelle
Download Presentation

Storing XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storing XML Sihem Amer-Yahia AT&T Labs - Research

  2. What’s XML? • W3C Standard since 1998 • Subset of SGML • (ISO Standard Generalized Markup Language) • Data-description markup language • HTML text-rendering markup language • De facto format for data exchange on Internet • Electronic commerce • Business-to-business (B2B) communication Storing XML

  3. XML: A Wire Protocol • XML = A minimal wire representation for data and storage exchange • A low-level wire transfer format – like IP in networking • Minimal level of standardization for distributed components to interoperate • Platform, language and vendor agnostic • Easy to understand and extensible • Data exchange enabled via XML transformations Storing XML

  4. Core XML Technologies • XML Validation: Contract for Data Exchange • DTD, Relax N/G, XML Schema • XML API: Programmatic Access to XML • DOM, SAX • Transformation Languages for Data Exchange and Display • XSL, XSLT, XPATH, XQuery Storing XML

  5. XML Data Model Highlights • Tagged elements describe semantics of data • Easier to parse for a machine and for a human • Element may have attributes • Element can contain nested sub-elements • Sub-elements may themselves be tagged elements or character data • Tree structure • Can capture any data-model • Easier to navigate Storing XML

  6. An XML Document <? xml version=" 1.0"?> <! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"> <sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue> Storing XML

  7. Document Type Definition (DTD) • An XML document may have a DTD • Grammar for describing document structure • Terminology • well-formed: if tags are correctly closed • valid: if it has a DTD and conforms to it • Validation useful for data exchange Storing XML

  8. W3C XML Schema • Rich set of scalar types • user-defined simple types • Complex types factor common structure • Sequences, choice, repetition, recursion of elements • Sub-typing supports schema reuse • Integrity constraints Storing XML

  9. DTD vs XML Schema • DTD <! ELEMENT article (title, initPage, endPage, author) > <! ELEMENT title (# PCDATA)> <! ELEMENT initPage (# PCDATA)> <! ELEMENT endPage (# PCDATA)> <! ELEMENT author (# PCDATA)> • XML Schema <xsd: element name=" article" minOccurs=" 0" maxOccurs=" unbounded"> <xsd: complexType> <xsd: sequence> <xsd: element name=" title" type=" xsd: string"/> <xsd: element name=" initPage" type=" xsd: string"/> <xsd: element name=" endPage" type=" xsd: string"/> <xsd: element name=" author" type=" xsd: string"/> </ xsd: sequence> </ xsd: complexType> </ xsd: element> Storing XML

  10. XML API: DOM • Hierarchical (tree) object model for XML documents • Associate a list of children with every node (or text value) • Preserves sequence of elements in XML document • May be expensive to materialize for a large XML collection Storing XML

  11. DOM Features • DOM API supports: • Navigation: access all attribute nodes, children, first/last child, next/previous sibling, parent,… • Creation: create new node • Modification: append, insert, remove, replace node • DOM parser support for validation • Most support DTD • Some support XML Schema • See : http://www.w3.org/XML/Schema Storing XML

  12. XML API: SAX • Event-driven: fire an event for every open tag/end tag • Does not require full parsing: reads XML document in streaming fashion • Read-only interface • Consumes less memory than DOM • Could be significantly faster than DOM Storing XML

  13. SAX Features • Stack-oriented (LIFO) access • Read-once processing of very large documents • E.g., load XML document into a storage system • SAX parser support for validation • Most support DTD • Microsoft XML Parser (MSXML) supports XML Schema Storing XML

  14. XSL • Styling is rendering information for consumption • XSL = A language to express styling (“Stylesheet language”) • Two components of a stylesheet • Transform: Source to a target tree using template rules expressed in XSLT • Format: Controls appearance Storing XML

  15. XSLT • XPATH acts as the pattern language • Primary goal is to transform XML vocabularies to XSL formatting vocabularies • But, often adequate for many transformation needs Storing XML

  16. XPATH • [www.w3.org/TR/xpath] • Common sub-language of • XSLT a loosely-typed, "scripting" language • XQuery a strongly-typed, query language • Syntax for tree navigation and node selection • Navigation is described using location paths Storing XML

  17. XPATH • . : current node • .. : parent of the current node • / : root node, or a separator between steps in a path • // : descendants of the current node • @ : attributes of the current node • * : "any“ (node with unrestricted name) • [] : a predicate for a given step • [n] : the element with the given ordinal number from a list of elements Storing XML

  18. XPATH 2.0 • Arithmetic Expr+,-,*,div,modExpr • Logical Expror/andExprnot(Expr) • Comparison Expr=,!=,<=,>= Expr • Conditional if Expr then Expr else Expr • IterationforVarinExprreturnExpr • Quantifiedsome/everyVarinExprsatisfiesExpr Storing XML

  19. XPATH Example • List the titles of articles in which the author has “Tom Hanks” • //article[//author=“Tom Hanks”]/title • Find the titles of articles authored by “Tom Hanks” in volume 1. • //issue[/volume=“1”]/articles/article/[//author=“TomHanks”]/title Storing XML

  20. Beyond XPATH • Joining, aggregating XML from multiple documents • Constructing new XML • Recursive processing of recursive XML data • Supported by XSLT & XQuery • Differences between XSLT & XQuery • Safety: XQuery enforces input & output types • Compositionality : XQuery maps XML to XML; XSLT maps XML to anything Storing XML

  21. XQuery • Functional language • Query is an expression • Expressions are recursively constructed • Includes XPATH as a sub-language • SQL-like FLWR expression • Borrows features from many other languages: XQL, XML-QL, ML,.. Storing XML

  22. XQuery: FLWR expression • FOR/LET Clauses • Ordered list of tuples of bound variables • WHERE Clause • Pruned list of tuples of bound variables • RETURN Clause • Instance of XML Query data model Storing XML

  23. XQuery: Example List the titles of the articles authored by “Tom Hanks” Query Expression for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return <title>$b/title.text()</title> Query Result <title>XML Research Issues</title> Storing XML

  24. XQuery: Example List the articles authored by “Tom Hanks”. Query Expression <articles> { for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return $b } </articles> Query Result <articles> <article> <title>XML:Where are we heading for?</title> <initPage>6</initPage> <endPage>10</endPage> <authors><author AuthorPosition="00">Tom Hanks</author> </authors> </article> </articles> Storing XML

  25. ? Business Application Logic Wrap SOAP/CORBA/Java RMI Where’s the XML Data? ? Export Legacy databases Import Warehouse XML data View Minimal result Storing XML

  26. XML and Databases • Data stored in SQL databases need to be published in XML for data exchange • Specification schemes for publishing needed • Efficient publishing algorithms needed • Storage and retrieval of XML documents • Need to support mapping schemes • Need to support data manipulation XML API-s Storing XML

  27. Storing XML • Storage foundation of efficient XML processing • XML demands own storage techniques • Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content • Characteristics of XML queries: Document order & structure, full-text search, transformation • Goals of tutorial • Existing storage features for XML • New storage features for XML Storing XML

  28. Outline • Introduction • XML Documents • XML Queries • Existing Storage Techniques • Non-native • Native • Physical Storage Features for XML Storing XML

  29. I. Introduction Storing XML

  30. Classes of XML Documents • Structured • “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed • Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms • Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML

  31. Structured Data: HL7 Lab Report Health-care industry data-exchange format <HL7> <PATIENT> <PID IDNum="PATID1234"> <PaNa><FaNa>Jones</FaNa><GiNa>William</GiNa></PaNa> <DTofBi><date>1961-06-13</date></DTofBi> <Sex>M</Sex> </PID> <OBX SetID="1"> <ObsVa>150</ObsVa> <ObsId>Na</ObsId> <AbnFl>Above high</AbnFl> </OBX> ... Storing XML

  32. Queries on Structured Data • Analogs of SQL • Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names • Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations Storing XML

  33. Mixed Data: Library of Congress Documents of U.S. Legislation <bill bill-stage="Introduction""> <congress>110th CONGRESS</congress> <session>1st Session</session> <legis-num>H.R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr.Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name> ... </action-desc> Storing XML

  34. Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML

  35. Properties of XML Data • Variance in structured content • Elements of same type have different structure • Nested sub-element might depend on parent • Direct access to sub-element not required • Order significant in sequence & mixed content • Structured data embedded in text • Schema known a priori or “open content model” • Desirable: explicit support in storage system Storing XML

  36. Properties of Queries • Query expressions depend on data properties • Variance • /PATIENT/(SURGERY | CHECK-UP) • Document order: XPath axes • /bill/co-sponsor[./text() = “Mrs.Clinton” and follow-sibling::co-sponsor/text() = “Mr. Torricelli”] • Node identity: equality, union/intersect/except • If not supported in storage system, then operators semantically incorrect or incomplete. Storing XML

  37. II. Existing Storage Techniques Storing XML

  38. Storage Techniques • Non-native • (Object) Relational, OO, LDAP directories • Indexing, recovery, transactions, updates, optimizers • Mapping from XML to target data model necessary • Captures variance in structured content • No support for mixed content • Recovering XML documents is expensive! • Native • Logical data model is XML • Physical storage features designed for XML Storing XML

  39. Non-native Techniques • Generic • Mapping from XML data to relational tables • Models XML as tree: semi-structured approach • Does not use DTD or XML Schema • Schema-driven • Mapping from schema constructs to relational • Fixed mapping from DTD to relational schema • Flexible mapping from XML Schema to relational • User-defined • Labor-intensive Storing XML

  40. Generic Mappings • Edge relation • store all edges in one table • Scalarvalues stored in separate table • Attribute relations • horizontal partition of Edge relation • Scalar values inlined in same table • Universal relation • full outer-join, redundancy • Captures node identity & document order • Element reconstruction requires multiple joins Storing XML

  41. Edge Relation Example &0 HL7 &1 PATIENT &2 PID OBX &3 &4 … @IDNum PaNa DTofBi &5 &6 &7 PATID1234 “Jones Wm” date &8 1961-06-13 Edge Table Value Table Storing XML

  42. Generic Mappings: LDAP Directories • Flexible schema; easy schema evolution • Supports heterogeneous elements with optional values • Captures node identity & document order • Query language captures subset of XPath Storing XML

  43. LDAP Example XMLElement OC { SUBCLASS OF {XMLNode} MUST CONTAIN {order} MAY CONTAIN {value} TYPE order INTEGER TYPE value STRING } XMLAttribute OC { SUBCLASS OF {XMLNode} MUST CONTAIN {value} TYPE value STRING } oc:XMLElement oid:1 name:PID order: 1 PID @IDNum PaNa DTofBi Sex oc:XMLElement oid:1.2 name: PaNa order: 1 value: Jones Wm oc:XMLAttribute oid:1.1 name: IDNum value: PATID1234 “PATID1234” “Jones Wm” date M 1961-06-13 Storing XML

  44. Schema-driven Mappings • Repetition : separate tables • Non-repeated sub-elements may be “inlined” • Optionality : nullable fields • Choice : multiple tables or universal table • Order : explicit ordinal value • Mixed content ignored • Element reconstruction may require multi-table joins because of normalization Storing XML

  45. Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT (Name, (OBX)*)> <!ELEMENT OBX (Name, Value) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Value (#PCDATA) > PATIENT * OBX Name Value PATIENT OBX • Element with in-degree = 0 or > 1 in DTD graph  relation • Elements with in-degree = 1 inlined except those reached by * • Non-* & non-recursive elements with in-degree > 1 inlined Storing XML

  46. Flexible Mapping : LegoDB • Canonical mapping from XML Schema to relational • Every complex type  relation • Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split • Greedy algorithm for choosing mapping • Mapping cost determined by query mix • Use relational optimizer to estimate cost of mapping Storing XML

  47. LegoDB Example • Inline type in parent vs. Outline type in own relation type OBX = element value { Integer }, type Description type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML

  48. User-Defined Mappings • No automatic translation from DTD or XML Schema • Annotated schemas or special-purpose queries • Value-based semantics only • Document structure represented by keys/foreign keys • No explicit representation of document order or node identity • Some support for mixed content Storing XML

  49. Oracle 9i • Canonical mapping into user-defined object-relational tables • Arbitrary XML input • XSLT preprocessing into multiple XML documents, load individually • Stores XML documents in CLOBs (character large objects) • Permits full-text search • Hybrid of canonical mapping & CLOB <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) Storing XML

  50. IBM DB2 XML Extender • Declarative decomposition of arbitrary XML • Pure relational mapping (no object features used) <element_node name="Order"> <table name="order_tab"/> <table name="part_tab"/> <condition> order_tab.order_key = part_tab.order_key </condition> <attribute_node name="key"> <table name="order_tab"/> <column name="order_key"/> </attribute_node> </element_node> • Mixed content CLOBs + side tables for indexing structured data embedded in text Storing XML

More Related