1 / 66

XML Schema: An Intensive One-Day Tutorial

XML Schema: An Intensive One-Day Tutorial. Henry S. Thompson HCRC Language Technology Group University of Edinburgh. When you see this, it means there’s accompanying information in the Additional Materials handbook. 2. Overview . What are schemata, anyway?

iman
Download Presentation

XML Schema: An Intensive One-Day Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Schema:An Intensive One-Day Tutorial Henry S. Thompson HCRC Language Technology Group University of Edinburgh

  2. When you see this, it means there’s accompanying information in the Additional Materials handbook 2 Overview • What are schemata, anyway? • The nature of document structure • Schema as contract • Taking control of structure definition • XML Schema: the activity • The W3C and its WGs • The Charter and Requirements • The state of play • The Draft RECs • A detailed walkthrough • Schemas and Layered Architecture

  3. Terminology • Documents have structure • Document types • Document instances • Structure can be defined • Informally (D. S. D.) • SGML DTD • XML DTD • Schema using XML

  4. Background • SGML DTDs for D. S. D • Sperberg-McQueen • Others • Considered for XML itself • MCF, then RDF, now DCD, by Bray et al. • XML-Data, two versions, now XML-Data reduced, by Layman et al., then Frankston and Thompson • SOX, from Veo Corp. • XSchema, from an ad-hoc group of designers

  5. Document Structure • Two relations are constitutive • Part-of • Kind-of • Existing DSD mechanisms use Content Models to specify part-of relations • But they only specify kind-of relations implicitly or informally • Making kind-of relations explicit would make both understanding and maintenance easier

  6. Taking Control of D. S. D. • Eric Naggum used to talk about SGML allowing users to take control of their data • XML allows the same move one level up, for developers • The starting point is much simpler • The architecture is congenial • The demand is there • We need to do this, to make the transition to validation easier

  7. Why validate? • A D. S. D. is a contract between producers and consumers • It provides a guaranteed interface • Producers validate to ensure they are providing what they promised • Consumers validate to check up on producers • and to protect their applications • Application authors validate to simplify their task • Leave error detection and analysis to the validating parser

  8. Reconstructing DTDs • The Schema DTD is expressed in vanilla XML • Top level elements for declaring • Elements :-) • Types • Notations • . . . • Subordinate element types for declaring • Attributes • Content models • . . .

  9. An aside about terminology • SGML and XML 1.0 talk about element types • XML Schema to date has been more casual and just talked about elements • Meaning either an element in an instance • Or the abstraction which is described in a DTD or Schema • Further confused by XML Schema making extensive use of type • Also, schema means many different things to different people • I'll try always to say/write XML Schema. . .

  10. A simple example <!ELEMENT text (#PCDATA|emph|name)*> <!ATTLIST text timestamp NMTOKEN #REQUIRED> <element name="text"> <type content="mixed"> <element ref="emph"/> <element ref="name"/> <attribute name="timestamp" type="date" minOccurs="1"/> </type></element>

  11. The Schema Architecture: Static • A document or an application or a user identifies a schema • Each is well-formed XML • The schema is valid w.r.t the Schema DTD • The document is schema-valid w.r.t the schema • The schema is schema-valid wrt the schema for schemas

  12. The Schema Architecture: Dynamic • An XML application (XSP) which schema-validates • ‘Takes control’ because changing how schemata work means • changing the Schema DTD/schema for schemas • upgrading XSP accordingly • not changing XML itself

  13. The W3C • XML Schema hopes to be a W3C Recommendation • The W3C is The World Wide Web Consortium, a voluntary association of companies and non-profit organisations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.

  14. . . . and its WGs • The XML recommendation was written by the W3C’s XML Working Group • Which split itself into pieces, of which one is the XML Schema WG • Chartered in the autumn of 1998 • Requirements document out in February of 1999 • Due to go to Last Call early in 2000

  15. 5 Requirements document • Full of good and hopeful requirements • DTDs and more • Support inheritance • Data-friendly • Good inventory of primitive datatypes

  16. 6 8 The state of play • Two component documents • Structures • Datatypes • Three public working drafts so far • May 1999 • September 1999 • November 1999: • Further (near-final) PWD out December 1999 http://www.w3.org/TR/xmlschema-1/ [contains pointers to previous drafts]

  17. The XML Schema worldview • Validity and well-formedness are XML 1.0 concepts • They are defined over character sequences • Namespace-compliant is a Namespace concept • It's defined over character sequences too • Schema-validity is the XML Schema concept • It is defined over XML document Infosets • So the whole XML Schema exercise is predicated on and layered on top of XML 1.0 well-formedness plus Namespaces • Because they are constitutive of the Infoset

  18. What's the Infoset? • The XML 1.0 plus Namespaces abstract data model • Defines a modest number of information items • Element, attribute, namespace declaration, ... • Each has required and optional properties • Name, children, …

  19. What the Infoset isn't • It's not the DOM • Much higher level • It's not about implementation or interfacing at all • But you can think of it as a data structure if that helps • It's not an SGML property set/grove • But it's close • It doesn't have the entity problem • a mixed blessing, as we will see

  20. The Schema and the Infoset • So crucially, schemas are about infosets, not character sequences • You could schema-validate a DOM tree you built by hand! • Using a schema which exists only as a DOM tree ditto • This simplifies things tremendously • but is hard to get your head around at first

  21. Basic XML Schema concepts • Syntax is not the Schema • Namespaces are fundamental • But a schema is not a namespace • Separation of tag from type • Simple and Complex types • Modular Schema construction • Powerful type construction • Local tag-type association • Powerful wildcards • Element equivalence classes • Extension mechanism • Documentation mechanism

  22. 10 Schema Walkthrough 1 • A Toy Purchase Order schema

  23. Types and Type Derivation • For purposes of discussion, consider only the content type aspects of types (attributes are analogous) • A content type definition (simple or complex) consists of a set of constraints on what's allowed as content.

  24. Permissions and obligations • You can think of the type itself as the set of strings/EIIs its constraints allow. It's helpful to think of constraints as composed of obligations and permissions: (\d )?(\d{3}-)?\d{3}-\d{4} • regexp definition facet for [US] 'phone number type • the ? and the \d can be seen as permissions, the - and the {3} as obligations • 1 337-6818 and 207-422-6240 belong to this type

  25. Complex types (title?,forename*,surname) • (shorthand for) content model for name • the ? can be seen as permission, the , and the 'surname' as obligations (at the end of the day, each component involves both permission AND obligation, but the balance of impact is as suggested)

  26. Complex types, cont'd (title?,forename*,surname) <name> <forename>...</forename> <surname>...</surname> </name> • and <name> <title>...</title> <surname>...</surname> </name> • are both members of this type

  27. Restriction • A type definition may be a restriction of another type's definition if it reduces permissions, sometimes to the point of inducing obligations: \d[01]\d-\d{3}-\d{4} (a restriction (\d )?(\d{3}-)?\d{3}-\d{4} of US p#) • The membership of this type, which includes • 207-422-6240 but not 1 337-6818 • is a (proper) subset of the membership of the original type, • because by construction every member of the new type is a member of the original.

  28. Restriction, cont'd • Similarly, (forename+,surname) • is a restriction of the original type definition for name (title?,forename*,surname) • and the same relation holds.

  29. Restriction, cont'd • Note first that (forename+,surname) <name> <forename>...</forename> <surname>...</surname> </name> • is a member of the new type, but <name> <title>...</title> <surname>...</surname> </name> • is not.

  30. Extension • Now consider (title?, forename*, surname, genMark?) • This type extends the original type definition for name. <name> <forename>Al</forename> <surname>Gore</surname> <genMark>Jr</genMark></name> • is an instance of this new type, but not of the original.

  31. Any • Finally note that the <any/> content model particle, in all of its forms, introduces particularly broad permissions into complex content types.

  32. Where are we headed? • A number of design decisions can now be stated: • Should we make it easy to construct type definitions which restrict or extend other type definitions, by specifying only the method of derivation and the differences between the source and derived type definitions? • The new proposal says 'yes', you do this by using the "source" and "derivedBy" attributes on your <type> or <datatype> element.

  33. Datatype example • Consider the simple type case first: <datatype name='bodytemp' source='decimal'> <precision value='4'/> <scale value='1'/> <minInclusive value='97.0'/> <maxInclusive value='105.0'/> </datatype>

  34. Derived type <datatype name='healthyBodytemp' source='bodytemp'> <maxInclusive value='99.5'/> </datatype> • The healthyBodytemp type definition is defined by closing down the permitted range of bodytemp. We say it 'inherits' the other facets of bodytemp, so the 'effective type definition' of healthyBodytemp is

  35. Effective type <datatype name='healthyBodytemp' source='decimal'> <precision value='4'/> <scale value='1'/> <minInclusive value='97.0'/> <maxInclusive value='99.5'/> </datatype> • Since it doesn't in general make sense to extend one simple type by another, the "derivedBy" attribute is actually redundant for <datatype>.

  36. Extension for complex types • The next simplest case is extension for complex types: <type name='name'> <element name='title' minOccurs='0'/> <element name='forename' minOccurs='0' maxOccurs='*'/> <element name='surname'/> </type>

  37. Derived type <type name='fullName' source='name' derivedBy='extension'> <element name='genMark' minOccurs='0'/> </type>

  38. The effective type <type name='fullName'> <element name='title' minOccurs='0'/> <element name='forename' minOccurs='0' maxOccurs='*'/> <element name='surname'/> <element name='genMark' minOccurs='0'/> </type>

  39. Restriction for complex types • Restriction for complex types is harder to handle syntactically, because of the significance of linear order in content models, but the semantics are completely parallel to the simple type case:

  40. Restriction example <type name='simpleName' source='name' derivedBy='restriction'> <restrictions> <element name='title' maxOccurs='0'/> <element name='forename' minOccurs='1'/> </restrictions> </type>

  41. Restriction and Inheritance • Just as in the <datatype> case, the content model aspects not mentioned are left alone, including the "maxOccurs='*'" on <forename> and the whole particle for <surname>, so the 'effective content model' of 'simpleName' is

  42. Effective type <type name='simpleName'> <element name='title' maxOccurs='0' minOccurs='0'/> <!-- i.e. forbidden --> <element name='forename' minOccurs='1' maxOccurs='*'/> <element name='surname'/> </type>

  43. Instances • Given all the example definitions above, all of <name><title>Ms</title><surname>Steinem</surname></name> <name xsi:type='simpleName'> <foreName>Harry</foreName> <foreName>S</foreName> <surname>Truman</surname> </name>

  44. Another instance <name xsi:type='fullName'> <forename>Al</forename> <surname>Gore</surname> <genMark>Jr</genMark> </name> • all would be schema-valid per <element name='name' type='name'/>

  45. Connecting Instances and Schemas • Like I said • A schema is not a namespace • The connection cannot be made rigid • The draft identifies three layers, first is • schema-valid(EII,TypeName,ComponentSet) • The TypeName is a (namespaceURI,NCName) pair • The component set is made up of (namespaceURI,NCName,component) triples

  46. Other layers • Layer 2: transfer syntax • Layer 3: web connections

  47. 13 Schema Walkthrough 2 • The Schema for Datatypes

  48. 21 Schema Walkthrough 3 • The Schema for Schemas

  49. Change of Gear • Let's look at the role of schemas in supporting the layered architecture which is emerging all around us

  50. XML is ASCII for the 21st century • ASCII (ISO 646) solved a fundamental interchange problem for flat text documents • What bits encode what characters • (For a pretty parochial definition of 'character') • UNICODE/ISO 10646 extends that solution to the whole world • XML thought it was doing the same for simple tree-structured documents • The emphasis in the XML design was on simplifying SGML to move it to the Web • XML didn't touch SGML's architectural vision • flexible linearisation/transfer syntax • for tree-structured documents with internal links

More Related