XML Data

XML Data • <book> <title> database systems</title> <author> John <lastname> Korth</lastname></author> <price currency = “USD”> 5.87</price> </book> • DTD • <!ELEMENT book (title, author+, price)> • <!ELEMENT title (#PCDATA)> • <!ELEMENT author(#PCDATA)|lastname)*

<tr> <td width="20%" valign="top"> Firma Karl-Heinz Rosowski </td> <td width="20%" valign="top"> Maikstraße 14 </td> <td width="20%" valign="top"> 22041 Hamburg </td> <td width="20%" valign="top"> 721 99 64 </td> <td width="20%" valign="top"> 21110111 </td> </tr> HTML Version • <?xml version="1.0"?> • <Addresses> • <Address id="12359"> • <Name>Firma Karl-Heinz Rosowski</Name> • <Street>Maikstraße 14</Street> • <ZIP>22041</ZIP> • <City>Hamburg</City> • <Tel>721 99 64</Tel> • <Fax>21110111</Fax> <Email/> • </Address> … • </Addresses> XML Version

XML - Document - Continued • <?xml version="1.0"?> is the XML declaration. • Elements:Most common form of markup. <element> … </element>. For example <name>Jack Lemon </name> • Attributes: are name-value pairs that occur inside start-tags after the element name. For example: <Address id="12359"> attaches value 12359 to attribute id of Address element. • Entity References: to handle special characters of XML like “<“ in the XML documents.

Comments: <!-- this is a comment --!> • CDATA Sections: a CDATA (string of characters) section instructs the parser to ignore most markup characters. For example source code, <![CDATA[ *p = &q; b = (I <= 3);]]>, between [CDATA[ and ]] all character data is passed to an application, with out interpretation.

XML - DTD - Element Type Declarations • Element type declarations: identify the names of elements and the nature of their content. A typical element type declaration looks like: • <!Element Address (Name, Street, ZIP?, City, Tel+, Fax*, Email?)> • Address is the element name, and (Name, Street, ZIP?, City, Tel+, Fax*, Email?) is the content model. Every address must contain, Name, Street, City and Tel. ZIP and Email are optional, whereas there can be zero or more Fax numbers.

The declarations for Name, Street, ZIP …, must also be given. For example • <!Element Name (#PCDATA)> • Attribute List Declarations: identify which elements may have attributes, what values the attributes may hold, and what value is default. Attribute values appear only within start-tags and empty-element tags. • <Address id="12359">

XML - Summary • HTML describes presentation • XML describes content • XML vs. HTML • users define new tags • arbitrary nesting • validation is possible

XML and Semi Structural Data Model • XML data is fundamentally different than relational and object oriented data. • XML is not rigidly structured. • In relational and OO data model every data instance has a schema which is separate and independent of the data. • XML data is self describing and can naturally model irregularities that cannot be modeled by relational or OO data model.

For example, data items may have missing elements or multiple occurrences of the same element; elements may have atomic values in some data items and structured values in others; and collections of elements can have heterogeneous structure. • Even XML data that has an associated DTD is self-describing (the schema is always stored with the data) and, except for very restricted forms of DTDs, may have all the irregularities described above. • XML is an instance of semistructured data.

XML-QL • Regular path expression • pattern matching • used edge labeled graphs • extract data from existing XML documents and construct new XML documents • support for ordered and unordered views on XML document • simple and declarative

XML-QL • The simplest XML-QL queries extract data from an XML document. Consider the following DTD: <!ELEMENT book (author+,title,publisher)> <!ATTLIST Book year CDATA> <!ELEMENT article (author+ title year?, (shortversion |longversion))> <!ATTLIST article type CDATA> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname?, lastname)>

XML-QL Example Data <bib> <book year=“1995> <title> An Introduction to DB Systems </title> <author> <lastname> Date </lastname></author> <publisher><name> Addison-Wesley</name> </publisher> </book> <book year=“1995> <title> Foundations for OR Databases </title> <author> <lastname> Date </lastname></author> <author> <lastname> Darwen </lastname></author> <publisher><name> Addison-Wesley</name> </publisher> </book> </bib>

Matching Data Using Patterns • XML uses element patterns to match data in an XML document. • Find all authors of books whose publisher is Addison-Wesley in XML document www.a.b.c/bib.xml WHERE <book> <publisher><name>Addison-Wesley</name></publisher> <title> $t </title> <author> $a </author> </book> IN “www.a.b.c/bib.xml” CONSTRUCT $a matches every <book> element in the XML document that has at least one <title> element, one <author> element , and one publisher element whose <name> is Addison-Wesley. For each such match it binds $t and $a to every title and author pair.

XML-QL Constructing XML Data • Often we would like format the result. • Find all authors and titles of books whose publisher is Addison-Wesley in XML document www.a.b.c/bib.xml WHERE <book> <publisher><name>Addison-Wesley</></> <title> $t </title> <author> $a </author> </book> IN “www.a.b.c/bib.xml” CONSTRUCT <result> <author> $a </> <title> $t </> </>

Constructing XML Data -cont. Result of the query: <result> <author><lastname> Date </lastname></author> <title> Introduction to Database Systems </title> </result> <result> <author><lastname> Date </lastname></author> <title> Foundations for OR Databases </title> </result> <result> <author><lastname> Darwen </lastname></author> <title> Foundations for OR Databases </title> </result> One result for each author, duplicating title information.

XML-QL Nested Queries. WHERE <book> <title> $t </> <publisher><name>Addison-Wesley</></> </> CONTENT_AS $p IN “www.a.b.c/bib.xml” CONSTRUCT <result> <title> $t </> WHERE <author> $a </> in $p CONSTRUCT <author> $a </> </> <result> <author><lastname> Date </lastname></author> <title> Introduction to Database Systems </title> </result> <result> <author><lastname> Date </lastname></author> <author><lastname> Darwen </lastname></author> <title> Foundations for OR Databases </title> </result>

XML-QL Join Queries XML queries cab express “joins” by matching two or more elements that contain same value. Find all articles that have at least one author who has written a book since 1995. WHERE <article> <author> <firstname> $f </> // firstname $f <lastname> $l </> // lastname $l </> </> CONTENT_AS $a IN "www.a.b.c/bib.xml" <book year=$y> <author> <firstname> $f </> // join on same firstname $f <lastname> $l </> // join on same lastname $l </> </> IN "www.a.b.c/bib.xml", y > 1995 CONSTRUCT <article> $a </>

XML-QL Data Model for XML • XML graph G in which each node is represented by a unique string called object identifier (OID), G’s edges are labelled with element tags, G’s nodes are labeled with sets of attribute value pairs, G’s leaves are labeled with one string value, and G has a distinguished node called root.

XML-QL Data Model for XML • The model allows several edges between the same two nodes with the following restriction: between any two nodes there can be at most one edge with a given label a node cannot have two leaf children with the same label and same string value • XML graphs are not only derived from XML documents, but are also generated by queries.

XML- Element Identity, Ids, and IDREFS • For element sharing XML reserves an attribute of type ID which allows a unique key to be associated with an element. • An attribute of type IDREF allows an element to refer to another element with the designated key, and one of the type IDREFS may refer to multiple elements.

<!ATTLIST person ID #REQUIRED> • <!ATTLIST article author IDREFS #IMPLIED> • <person ID="o123"> • <firstname>John</firstname> • <lastname>Smith<lastname> • </person> • <person ID="o234"> • . . . • </person> • <article author="o123 o234"> • <title> ... </title> • <year> 1995 </year> • </article>

XML- Element Identity, Ids, and IDREFS

The following query produces all lastname, title pairs by joining the author element's IDREF attribute value with the person element's ID attribute value. • WHERE <article author=$i> • <title> </> ELEMENT_AS $t • </>, • <person ID=$i> • <lastname> </> ELEMENT_AS $l • </> • CONSTRUCT <result> $t $l</> • The idiom <title></> ELEMENT_AS $t binds $t to a <title> element with arbitrary contents. The element expression <title/> matches a <title> element with empty contents.

XML-QL- Advanced Examples Tag Variables Regular Path Expressions Transforming XML Data (from one DTD to another) Integrating Data from different XML sources Embedding queries in data XML-QL check http://www3.org/TR/NOTE-xml-ql

XML Data

XML Data

Presentation Transcript

II. XML Data Management

XML Data Management

Data Representation System XML

XML Data and Technologies

XML Data

Inferring XML Schema Definitions from XML Data

XML Data Management 5. Extracting Data from XML: XPath

XML Data

XML Data Quality Modeling

XML and Data Management XML Processors

XML Data

XML Data Model

XML: Semistructured Data

Complex Data and XML

Loading XML Data

Data-centric XML

Semi-Structured data (XML)