XML Introduction

XML Introduction

Introducing XML • XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. • Because it is extensible, XML can be used to create a wide variety of document types.

Introducing XML • XML is a subset of a the Standard Generalized Markup Language (SGML) which was introduced in the 1980s. SGML is very complex and can be costly. • These reasons led to the creation of Hypertext Markup Language (HTML), a more easily used markup language. XML can be seen as sitting between SGML and HTML – easier to learn than SGML, but more robust than HTML.

The Limits of HTML • HTML was designed for formatting text on a Web page. It was not designed for dealing with the content of a Web page. Additional features have been added to HTML, but they do not solve data description or cataloging issues in an HTML document. • Because HTML is not extensible, it cannot be modified to meet specific needs. Browser developers have added features making HTML more robust, but this has resulted in a confusing mix of different HTML standards.

Introducing XML • HTML cannot be applied consistently. Different browsers require different standards making the final document appear differently on one browser compared with another.

Introduction to XML Markup • XML document (intro.xml) • Marks up message as XML • Commonly stored in text files • Extension .xml

1 <?xml version = "1.0"?> Document begins with declaration that specifies XML version 1.0 2 3  Element message is child element of root elementmyMessage 4  Line numbers are not part of XML document. We include them for clarity. 5 6 <myMessage> 7 <message>Welcome to XML!</message> 8 </myMessage>

Introduction to XML Markup (cont.) • XML documents • Must contain exactly one root element • Attempting to create more than one root element is erroneous • Elements must be nested properly • Incorrect:<x><y>hello</x></y> • Correct:<x><y>hello</y></x> • Must be well-formed

XML Parsers • An XML processor (also called XML parser) evaluates the document to make sure it conforms to all XML specifications for structure and syntax. • XML parsers are strict. It is this rigidity built into XML that ensures XML code accepted by the parser will work the same everywhere.

XML Architecture

Structure of a Well-formed XML Document <?xml version="1.0" ?> <!DOCTYPE publication [ <!ELEMENT publications (journals, conferences, books)> ... <!ELEMENT author (#PCDATA)> <!ELEMENT issue (#PCDATA)> <!ATTLIST issue pages CDATA #REQUIRED> <!ENTITY JSI " <journal>Journal of Systems Integration</journal> <publisher>Kluwer Academic Publishers</publisher>"> ]> <publications> <journals> ... &JSI; ... </publications>

XML Parsers • Microsoft’s parser is called MSXML and is built directly in IE versions 5.0 and above. • Netscape developed its own parser, called Mozilla, which is built into version 6.0 and above.

Parsers and Well-formed XML Documents (cont.) • XML parsers support • Document Object Model (DOM) • Builds tree structure containing document data in memory • Simple API for XML (SAX) • Generates events when tags, comments, etc. are encountered • (Events are notifications to the application)

Parsing an XML Document with MSXML • XML document • Contains data • Does not contain formatting information • Load XML document into Internet Explorer 5.0 • Document is parsed by msxml. • Places plus (+) or minus (-) signs next to container elements • Plus sign indicates that all child elements are hidden • Clicking plus sign expands container element • Displays children • Minus sign indicates that all child elements are visible • Clicking minus sign collapses container element • Hides children • Error generated, if document is not well formed

XML document shown in IE6.

Character Set • XML documents may contain • Carriage returns • Line feeds • Unicode characters • Enables computers to process characters for several languages

Characters vs. Markup • XML must differentiate between • Markup text • Enclosed in angle brackets (< and >) • e.g,. Child elements • Character data • Text between start tag and end tag • Welcome to XML! • Elements versus Attributes

White Space, Entity References and Built-in Entities • Whitespace characters • Spaces, tabs, line feeds and carriage returns • Significant (preserved by application) • Insignificant (not preserved by application) • Normalization • Whitespace collapsed into single whitespace character • Sometimes whitespace removed entirely <markup>This is character data</markup> after normalization, becomes <markup>This is character data</markup>

White Space, Entity References and Built-in Entities (cont.) • XML-reserved characters • Ampersand (&) • Left-angle bracket (<) • Right-angle bracket (>) • Apostrophe (’) • Double quote (”) • Entity references • Allow to use XML-reserved characters • Begin with ampersand (&) and end with semicolon (;) • Prevents from misinterpreting character data as markup

White Space, Entity References and Built-in Entities (cont.) • Build-in entities • Ampersand (&) • Left-angle bracket (<) • Right-angle bracket (>) • Apostrophe (') • Quotation mark (") • Mark up characters “<>&” in element message <message><>&</message>

Document Object Model (DOM) • XML Document Object Model (DOM) • Build tree structure in memory for XML documents • DOM-based parsers parse these structures • Exist in several languages (Java, C, C++, Python, Perl, C#, VB.NET, VB, etc)

Document Object Model (DOM) • DOM tree • Each node represents an element, attribute, etc. <?xml version ="1.0"?><message from = "Paul" to = "Tem"> <body>Hi, Tim!</body></message> • Node created for element message • Element message has child node for body element • Element body has child node for text "Hi, Tim!" • Attributes from and to also have nodes in tree

DOM Implementations • DOM-based parsers • Microsoft’s msxml • Microsoft.NET System.Xml Namspace • Sun Microsystem’s JAXP

Creating Nodes • Create XML document at run time

Traversing the DOM • Use DOM to traverse XML document • Output element nodes • Output attribute nodes • Output text nodes

DOM Components • Manipulate XML document

XPATH • XML Path Language (XPath) • Syntax for locating information in XML document • e.g., attribute values • String-based language of expressions • Not structural language like XML • Used by other XML technologies • XSLT

XPATH - Nodes • XML document • Tree structure with nodes • Each node represents part of XML document • Seven types • Root • Element • Attribute • Text • Comment • Processing instruction • Namespace • Attributes and namespaces are not children of their parent node • They describe their parent node

XPath node types

XPath node types. (Part 2)

Location Paths • Location path • Expression specifying how to navigate XPath tree • Composed of location steps • Each location step composed of • Axis • Node test • Predicate

Axes • XPath searches are made relative to context node • Axis • Indicates which nodes are included in search • Relative to context node • Dictates node ordering in set • Forward axes select nodes that follow context node • Reverse axes select nodes that precede context node

Node Tests • Node tests • Refine set of nodes selected by axis • Rely upon axis’ principle node type • Corresponds to type of node axis can select

Node-set Operators and Functions (cont.) • Location-path expressions • Combine node-set operators and functions • Select all head and body children element nodes head | body • Select last bold element node in head element node head/title[ last() ] • Select third book element book[ position() = 3 ] • Or alternatively book[ 3 ] • Return total number of element-node children count( * ) • Select all book element nodes in document //book

Sample Data for Queries <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><bookprice=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib>

bib Data Model for XPath The root The root element book book publisher author . . . . Addison-Wesley Serge Abiteboul

XPath: Simple Expressions Result: <year> 1995 </year> <year> 1998 </year> Result: empty (there were no papers) /bib/book/year /bib/paper/year

XML Document Type Definitions • Declarations Definition of element and attribute • Content Model (regular expressions) • Association of attributes with elements • Association of elements with other • Order and cardinality constraints

Element Declarations • Basic form – <!ELEMENT elementname (contentmodel)> – Contentmodel determines which – Given by a regular expression • Atomic contents • Element content <!ELEMENT example ( a )> • Text content <!ELEMENT example (#PCDATA)> – Empty Element <!ELEMENT example EMPTY> – Arbitrary content <!ELEMENT example ANY>

Element Declarations • Sequence <!ELEMENT example ( a, b )> • Alternative <!ELEMENT example ( a | b )> • Optional (zero or one) <!ELEMENT example ( a )?> • Optional and repeatable (zero or more) <!ELEMENT example ( a )*> • Required and repeatable (one or more) <!ELEMENT example ( a )+> • Mixed content <!ELEMENT example (#PCDATA | a)*> • Content model can be grouped by parentheses • Cyclic element containment is allowed

Attribute Declarations • Each element can be associated with an arbitrary number of attributes • Basic form – <!ATTLIST Elementname Attributename Type Default Attributename Type Default ... > • Example: Document Type Definition <!ELEMENT shipTo ( #PCDATA)> <!ATTLIST shipTo country CDATA #REQUIRED "US" state CDATA #IMPLIED version CDATA #FIXED "1.0" payment (cash|creditCard) "cash"> Document <shipTo country="Switzerland" version="1.0" payment="creditCard"> … </shipTo>

Attribute Declarations - Types • CDATA – String – <!ATTLIST example HREF CDATA • Enumeration – Token from given set of values, Default – <!ATTLIST example selection ( • Possible Defaults – Required attribute: #REQUIRED – Optional attribute: #IMPLIED – Fixed attribute: #FIXED – Default for enumeration: "value" • Other attribute types: IF, IDREF, ENTITY, ENTITIES, NOTATION, NAME, NAMES, NMTOKEN, NMTOKENS

ID/IDREFExample: ID/IDREF • ID, IDREF – ID is a unique identifier within the document – IDREF is a reference to an ID – Referential integrity checked by the parser – ID's determined by the application – <!ATTLIST example identity ID #IMPLIED reference IDREF #IMPLIED>

Inclusion of XML Document Type Definitions • External DTD Declaration <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"> <test> "test" is a document element </test> • Internal DTD Declaration <!DOCTYPE test [ <!ELEMENT test EMPTY> ]> <test/> • Mixed usage <!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world"> ]> <test>&hello;</test>

Working with Namespaces • Name collision occurs when elements from two or more documents share the same name. • Name collision isn’t a problem if you are not concerned with validation. The document content only needs to be well-formed. • However, name collision will keep a document from being validated.

Name Collision This figure shows two documents each with a Name element

Using Namespaces to Avoid Name Collision This figure shows how to use a namespace to avoid collision

Declaring a Namespace • A namespace is a defined collection of element and attribute names. • Names that belong to the same namespace must be unique. Elements can share the same name if they reside in different namespaces. • Namespaces must be declared before they can be used.

Declaring a Namespace • A namespace can be declared in the prolog or as an element attribute. The syntax to declare a namespace in the prolog is: <?xml:namespace ns=“URI” prefix=“prefix”?> • Where URI is a Uniform Resource Identifier that assigns a unique name to the namespace, and prefix is a string of letters that associates each element or attribute in the document with the declared namespace.

Declaring a Namespace • For example, <?xml:namespace ns=http://uhosp/patients/ns prefix=“pat”> • Declares a namespace with the prefix “pat” and the URI http://uhosp/patients/ns. • The URI is not a Web address. A URI identifies a physical or an abstract resource.

XML Introduction