1 / 60

Querying XML Documents and Data

Querying XML Documents and Data. CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science Pekka.Kilpelainen@cs.uku.fi. order. XML. invoice. Internet. Introduction & Motivation. XML appears everywhere How to query it?.

santa
Download Presentation

Querying XML Documents and Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying XML Documents and Data CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science Pekka.Kilpelainen@cs.uku.fi

  2. order XML invoice Internet Introduction & Motivation • XML appears everywhere • How to query it? Querying XML: Introduction

  3. Main Topic: Two XML Query Models • Region algebra • for retrieval of structuded text • "lightweight" • reduced language; for ad-hoc files; efficient free implementation • XQuery • for general querying/manipulation of XML • "heavy" • comprehensive and complex language; for (data viewed as) XML only; production-use implementations? Querying XML: Introduction

  4. Course Outline Intro and Arrangements; Structured documents 1 Review of XML Basics 1.1 XML and XML docs; 1.2 Document grammars 1.3. XML DTDs; 1.4 XML Namespaces 1.5 XML Schema 2 Region Algebra and sgrep 3 W3C XQuery, and XPath 2.0 (Apologies for potential dis-organization!) Querying XML: Introduction

  5. Arrangements • Background: • W3C Recommendations (XML, XQuery) • Reports (Region Algebra, sgrep) • Earlier courses; own research and experiments • Some material (to be posted) at http://www.cs.uku.fi/~kilpelai/CBU07/ • Plan: Lectures 12 h; hands-on exercises 8 h Querying XML: Introduction

  6. Structured Documents • Document: • a structured representation of information on some medium ( message) • normally for a human reader • memos, manuals, articles, books, … • also application-to-application messages • e.g., btw client and server in Web Services • "prose-oriented XML" vs "data-oriented XML" • can be treated as a single unit • (a web page vs a web site) Querying XML: Introduction

  7. Presentation vs Structure • Presentation informs the human readerabout the meaning of text and the role of its parts • Markup indicates the presentation or the meaning of different parts of text • originally hand-written annotations for the typesetter • nowadays primarily codes embedded in digital documents; <Tags> Querying XML: Introduction

  8. Markup and Markup Language • Procedural markup • commands (start boldface, produce empty line, indent 5 mm, ...) • proprietary word processor formats, nroff, TeX, ... • Descriptive or generic markup • indicates conceptual structures using chosen names • LaTeX: \begin{abstract} ... \end{abstract} • HTML: <TITLE> ... </TITLE> • Markup language • a fixed set of markup notations (e.g. nroff, TeX, HTML, SVG, …) Querying XML: Introduction

  9. Structure in Documents • Hierarchy or nesting is ubiquitous • Sections w. subsections etc • (Also overlapping hierarchies!) • Linear order essential in prose documents • less important in documents representing data objects • Hypertext and cross-references • XML: proper hierarchies, tree-like structures, with cross-references via attribute values Querying XML: Introduction

  10. 1 Document Instances and Grammars Overview of fundamentals, and some details, of XML 1.1 XML and XML documents 1.2 Basics of document grammars 1.3 Basics of XML DTDs 1.4 XML Namespaces 1.5 XML Schema Querying XML: Introduction

  11. 2.1 XML and XML documents • XML - Extensible Markup Language,W3C Recommendation, February 1998 • not an official standard, but a stable industry standard • 2nd Ed 2000, 3rd Ed 2004, 4th Ed 2006 • editorial revisions, not new versions of XML 1.0 • a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 • valid XML documents are also SGML documents Querying XML: Introduction

  12. What is XML? • Extensible Markup Language is not a markup language! • does not fix a tag set nor its semantics (like markup languages like HTML do) • XML documents have no inherent(processing or presentation) semantics • even though many think that XML is semantic or self-describing; See next Querying XML: Introduction

  13. Semantics of XML Markup • Meaning of this XML fragment? • The application has to “understand” the tags • But better off with the tags, though! Querying XML: Introduction

  14. What is XML (2)? • XML is • a way to use markup to represent information • a metalanguage • supports definition of specific markup languages through XML DTDs (Document Type Definitions) or Schemas • E.g. XHTML a reformulation of HTML using XML • Often “XML”  XML + XML technology Querying XML: Introduction

  15. How does it look? <?xml version=’1.0’ encoding=”iso-8859-1” ?> <invoice num=”1234”> <client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <email>kilpelai@cs.uku.fi</email> </client> <item price=”60” unit=”EUR”> XML Handbook</item> <item price=”350” unit=”FIM”> XSLT Programmer’s Ref</item> </invoice> Querying XML: Introduction

  16. Essential Features of XML • Overview of XML essentials • many details skipped • Learn to consult original sources (specifications, documentation etc) for details! • The XML specification is easy to browse • First of all, XML is a textual or character-based way to represent data Querying XML: Introduction

  17. XML Document Characters • XML documents are made of ISO-10646 (32-bit) characters; in practice of their 16-bit Unicode subset (used, e.g., in Java) • Unicode 2.0 defines almost 39,000 distinct characters • Characters have three different aspects: • their identification as numeric code points • their representation by bytes • their visual presentation Querying XML: Introduction

  18. External Aspects of Characters • Documents are stored/transmitted as a sequence of bytes (of 8 bits). An encoding determines how characters are represented by bytes. • UTF-8 (7-bit ASCII) is the XML default encoding • encoding="KOI8R"should be OK for Cyrillic texts • (I cannot comment on parser support) • A font determines the visual presentation of characters Querying XML: Introduction

  19. XML Encoding of Structure 1 • XML is, essentially, a textual encoding scheme of labelled, ordered and attributedtrees: • internal nodes are elements labelled by type names • leaves are text nodes labelled by string values, or empty element nodes • the left-to-right order of children of a node matters • element nodes may carry attributes= (name, string-value) pairs • This view is shared by many XML techniques (DOM, XPath, XSLT, XQuery, ...) Querying XML: Introduction

  20. XML Encoding of Structure 2 • XML encoding of a tree • corresponds to a pre-order walk • start of an element node with type name A denoted by a start tag <A>, and its end denoted by end tag </A> • possible attributes written within the start tag: <A attr1=“value1” … attrn=“valuen”> • Names attr1,…,attrn must be distinct • text nodes written as their string value Querying XML: Introduction

  21. XML Encoding of Structure: Example <S> S W E W A=1 world! Hello </S> <W> Hello </W> <E A=‘1’/> <W> world! </W> Querying XML: Introduction

  22. XML: Logical Document Structure • Elements • indicated by matching (case-sensitive!) tags<ElementTypeName> …</ElementTypeName> • can contain text and/or subelements • can be empty:<elem-type></elem-type> or <elem-type/> (e.g. <br/> in XHTML) • unique root element -> document a single tree Querying XML: Introduction

  23. Logical document structure (2) • Attributes • name-value pairs attached to elements • in start-tag after the element type name <div class="preface" date='990126'> … • forms "..." and '...' are interchangeable • Also: • <!--comments outside other markup--> • <?note this would be passed to the application as a processing instruction named ‘note’?> Querying XML: Introduction

  24. CDATA Sections • “CDATA Sections” to include XML markup characters as textual content <![CDATA[Here we can easily include markup characters and, for example, code fragments: <example>if (Count < 5 && Count > 0) </example> ]]> Querying XML: Introduction

  25. Two levels of correctness (1) • Well-formed documents • roughly: follows the syntax of XML,markup correct (elements properly nested, tag names match, attributes of an element have unique names, ...) • violation is a fatal error • Validdocuments • (in addition to being well-formed) obey an associated grammar (DTD/Schema) Querying XML: Introduction

  26. XML docs and valid XML docs DTD-valid documents Schema-valid documents XML documents = well-formed XML documents Querying XML: Introduction

  27. An XML Processor (Parser) • Reads XML documents and reports their contents to an application • relieves the application from details of markup • XML Recommendation specifies: • recognition of characters as markup or data; what information to pass to applications; how to check the correctness of documents; • validation based on comparing document against its grammar Next: Basics of document grammars Querying XML: Introduction

  28. 1.2 Basics of document grammars • DTDs are variations of context-free grammars (CFGs), which are widely used to syntax specification (programming languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison) • No knowledge of them is necessary, but connections with CFGs may be informative for those that know about them Querying XML: Introduction

  29. DTD -------------------------------- XML document element type element type declaration #PCDATA CFG ------------------------ parse/syntax tree nonterminal production terminal DTD/CFG Correspondence Querying XML: Introduction

  30. Example: Three Authors of a Ref Ref Ref -> Author* Title PublData  P,Author Author Author Title PublData  L(Author* Title PublData) Author Author Author Title PublData . . . Aho Hopcroft Ullman The Design and Analysis ... Querying XML: Introduction

  31. Extended Productions • Notice the regular expressions in productions • to describe (potentially infinite) sequences • That is, we are using extended CFGs • content models (of a DTD) correspond to regular expressions (in an ECFG production) • > number of element’s children generally unlimited Querying XML: Introduction

  32. 1.3 Basics of XML DTDs • A Document Type Declaration provides a grammar (document type definition, DTD) for a class of documents [Defined in XML Rec] • Syntax (in the prolog of a document instance):<!DOCTYPE rootElemType SYSTEM "ex.dtd" <!-- "external subset" in file ex.dtd --> [ <!–- an optional "internal subset" --> ] > • DTD = union of the external and internal subset • internal has preference for attribute and entity decls Querying XML: Introduction

  33. Markup Declarations • DTD consists of markup declarations • element type declarations • ≈ productions of ECFGs • attribute-list declarations • for declared element types • entity declarations • for physical structures • notation declarations logical structures Querying XML: Introduction

  34. How do Declarations Look Like? <!ELEMENT invoice (client, item+)> <!ATTLIST invoice num NMTOKEN #REQUIRED> <!ELEMENT client (name, email?)> <!ATTLIST client num NMTOKEN #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT item (#PCDATA)> <!ATTLIST item price NMTOKEN #REQUIRED unit (FIM | EUR) ”EUR” > Querying XML: Introduction

  35. Element Type Declarations • General form:<!ELEMENT elementTypeName (E)>where E is a content model • regular expression of element names • Content model operators: E | F : choice E, F: concatenation E? : optional E* : zero or more E+ : one or more (E) : grouping • Must group: (A,B)|C or A,(B|C), but A,B|C forbidden Querying XML: Introduction

  36. Attribute-List Declarations • Can declare attributes for elements: • Name, data type and possible default value • Example:<!ATTLIST FIG id ID #IMPLIED descr CDATA #REQUIRED class (a | b | c) "a"> • Semantics mainly up to the application • processor checks that ID attributes are unique and that targets of IDREF attributes exist Querying XML: Introduction

  37. Mixed, Empty and Arbitrary Content • Mixed content:<!ELEMENT P (#PCDATA | I | IMG)*> • may contain text and elements • Empty content:<!ELEMENT IMG EMPTY> • Unrestricted content: ANY(= (#PCDATA |choice-of-all-declared-element-types)* ) Querying XML: Introduction

  38. Entities (1) • Named storage units of XML documents • Multiple uses: • character entities: • &lt;&#60; and &#x3C; all expand to ‘<‘(treated as data, not as start-of-markup) • other predefined entities: &amp; &gt; &apos; &quote;expand to&, >, ' and" • general entities are shorthand notations:<!ENTITY UKU "University of Kuopio"> Querying XML: Introduction

  39. Entities (2) • physical storage units comprising a document • parsed entities<!ENTITY chap1 SYSTEM "http://myweb/ch1"> • document entity is the starting point of processing • entities and elements must nest properly: <sec num="1"> … </sec> <sec num="2"> … </sec> <!DOCTYPE doc [ <!ENTITY chap1 (… as above …)> ]> <doc> &chap1; </doc> Querying XML: Introduction

  40. Unparsed Entities and Parameter Entities • Unparsed entities allow XML documents refer to external binary objects like graphics files • XML processor handles only text • I've rarely used these • Parameter entities are used in DTDs • useful for modularizing declarations • We skip these Querying XML: Introduction

  41. HTML elements XSLT elements/instructions 1.4 XML Namespaces • Documents often comprise parts processed by different applications (and/or defined by different grammars) • for example, in XSLT scripts:<xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> • How to manage multiple sets of names? Querying XML: Introduction

  42. XML Namespaces (2/5) • Solution: • By introducing (arbitrary) local name prefixes, and binding them to (fixed) globally unique URIs • For example, the local prefix “xsl:” conventionally used in XSLT scripts Querying XML: Introduction

  43. XML Namespaces briefly (3/5) • Namespace identified by a URI (through the associated local prexif) e.g.http://www.w3.org/1999/XSL/Transformfor XSLT • conventional but not required to use URLs • the identifier has to be unique, but no need to be an address • Association inherited to sub-elements • see the next example (of an XSLT script) Querying XML: Introduction

  44. XML Namespaces (4/5) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict"><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <H1> <xsl:apply-templates /> </H1> </xsl:template> </xsl:stylesheet> Querying XML: Introduction

  45. XML Namespaces briefly (5/5) • Mechanism built on top of basic XML • overloads attribute syntax (xmlns:) to introduce namespaces • does not affect validation • namespace attributes have to be declared for DTD-validity • all element type names have to be declared (with their initial prefixes!) • > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespaces Querying XML: Introduction

  46. 1.5 XML Schemas • A quick look at XML Schema • W3C Recommendation,1st Ed. May, 2001; 2nd Ed. Oct, 2004: • XML Schema Part 0: Primer (readable non-normative introduction; Recommended) • XML Schema Part 1: Structures • XML Schema Part 2: Datatypes • W3C Draft (didn't lead anywhere?): • Formal Description, 9/2001 Querying XML: Introduction

  47. Advantages of XML Schema (1) • XML syntax • easier to manipulate by programs (than DTDs) • Compatibility with namespaces • can validate against declarations from multiple sources • Content datatypes • 44 built-in datatypes (including primitive Java datatypes, datatypes of SQL, and XML attribute types) • mechanisms to derive user-defined datatypes • used as types of XQuery Querying XML: Introduction

  48. XSDL built-in types (Part 2, Chap. 3) CDATA * * * * * * * * * NB: all simple values in documents strings *: XML attribute types Querying XML: Introduction

  49. Advantages of XML Schema (2) • Element names and content types independent; Compare with • For example, could define titles • of people as “Mr.”/”Mrs.”/”Ms.”, and • of chapters as strings • > extend the power of CFGs/DTDs • where non-terminal / tag-name alone determines its allowed content • (Is this relevant in practice?) Querying XML: Introduction

  50. Advantages of XML Schema (3) • Ability to specify uniqueness and keys within selected parts of the document • for example, that titlesof chapters should be unique; or key attributes of relations • uses XPath • Support for schema documentation • element annotation with sub-elementsdocumentation(for human readers) andappInfo(for applications) • Only these contain text (#PCDATA) Querying XML: Introduction

More Related