1 / 40

Effective XML

Effective XML. Elliotte Rusty Harold elharo@metalab.unc.edu http://www.cafeconleche.org/. Part I: Syntax. Stay with XML 1.0. XML 1.1: New name characters C0 control characters C1 control characters NEL Undeclare namespace prefixes Incompatible with Most XML parsers

newman
Download Presentation

Effective XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective XML • Elliotte Rusty Harold • elharo@metalab.unc.edu • http://www.cafeconleche.org/

  2. Part I: Syntax

  3. Stay with XML 1.0 • XML 1.1: • New name characters • C0 control characters • C1 control characters • NEL • Undeclare namespace prefixes • Incompatible with • Most XML parsers • W3C and RELAX NG schema languages • XOM, JDOM

  4. Part II: Structure

  5. The XML Stack

  6. Allow All XML syntax • CDATA sections • Entity references • Processing instructions • Comments • Numeric character references • Document type declarations • Different ways of representing the same core content; not different information

  7. Distinguish text from markup • A DocBook element <programlisting><![CDATA[<value> <double>28657</double></value>]]></programlisting> • The content is:<value> <double>28657</double></value> • This is the same:<programlisting>&lt;value&gt; &lt;double&gt;28657&lt;/double&gt; &lt;/value&gt;</programlisting>

  8. The reverse problem • Tools that create XML from strings: • Tree-based editors like <Oxygen/> or XML Spy • WYSIWYG applications like OpenOffice Writer • Programming APIs such as DOM, JDOM, and XOM • The tool automatically escapes reserved characters like <, >, or &. • Just because something looks like an XML tag does not mean it is an XML tag.

  9. White space matters • Parsers report all white space in element content, including boundary white space • An xml:space attribute is for the client application only, not the parser • White space in attribute values is normalized • Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.

  10. Make structure explicit through markup • Bad <Transaction>Withdrawal 2003 12 15 200.00</Transaction> • Better <Transaction type="withdrawal"> <Date>2003-12-15</Date> <Amount>200.00</Amount> </Transaction>

  11. Store metadata in attributes • Material the reader doesn’t want to see • URLs • IDs • Styles • Revision dates • Authors name • No substructure • Revision tracking • Citations • No multiple elements

  12. Remember mixed content • Narrative documents • Record-like documents • The RSS problem <item> <title>Xerlin 1.3 released</title> <description> Xerlin 1.3, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include XML Schema support, WebDAV capabilities, and various user interface enhancements. Java 1.2 or later is required. </description> <link>http://www.cafeconleche.org/#news2003April7</link> </item>

  13. What you really want is this: <description> <p><a href="http://www.xerlin.org"><strong>Xerlin 1.3</strong></a>,an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:</p> <ul> <li>XML Schema support</li> <li>WebDAV capabilities</li> <li>Various user interface enhancements</li> </ul> <p>Java 1.2 or later is required.</p> </description>

  14. What people do is this: <description>&lt;p>&lt;a href="http://www.xerlin.org">&lt;strong>Xerlin 1.3&lt;/strong>&lt;/a>, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:&lt;/p> &lt;ul> &lt;li>XML Schema support&lt;/li> &lt;li>WebDAV capabilities&lt;/li> &lt;li>Various user interface enhancements&lt;/li> &lt;/ul> &lt;p>Java 1.2 or later is required.&lt;/p> </description>

  15. Prefer URLs to unparsed entities and notations • URLs are simple and well understood • Notations and unparsed entities are confusing and little used • URLs don’t require the DTD to be read • Many APIs don’t even support notations and unparsed entities

  16. Part III: Semantics

  17. Use processing instructions for process-specific content • For a very particular, even local, process • Describes how a particular process acts on the data in the document • Does not describe or add to the content itself • A unit that can be treated in isolation • Content is not XML-like. • Applies to the entire document

  18. Processing instructions are not appropriate when: • Content is closely related to the content of the document itself • Structure extends beyond a single processing instruction • Needs to be validated

  19. Include all information in instance documents • Not all parsers read the DTD • Especially browsers • Beware • Default attribute values • Parsed entity references • XInclude • ID type dependence (XPath, DOM, etc.)

  20. Encode binary data using quoted printable and/or Base64 • Quoted printable works well for mostly text • Base-64 for non-text data • Can you link to the data with a URL instead?

  21. Use namespaces for modularity and extensibility • Not hard; simple cases can use one default namespace • http URIs are normally preferred • DTD validation is tricky • Code to namespace URIs, not prefixes • Avoid namespace prefixes in element content and attribute values

  22. Reuse XHTML for generic narrative content

  23. Choose the right schema language for the job • DTDs • The W3C XML Schema Language • RELAX NG • Schematron

  24. Use only what you need • You need • Well-formed XML 1.0 • A parser • You probably need: • Namespaces • You may not need: • DTDs • Schemas • XInclude • SOAP • WS-Kitchen-Sink • etc.

  25. Always use a parser • Can’t use regular expressions: • Detecting encoding • Comments and processing instructions that contain tags • CDATA sections • Unexpected placement of spaces and line breaks within tags • Default attribute values • Character and entity references • Malformed documents • Internal DTD Subset • Why not? • Unfamiliarity with parsers • Too slow

  26. Layer Functionality

  27. Program to standard APIs • Easier to deploy in Java 1.4/1.5 • Different implementations have different performance characteristics • SAX is fast • DOM interoperates • Semi-standard: • JDOM • XOM • Bleeding edge • StAX • JAXB

  28. Read the complete DTD • Be conservative in what you generate; liberal in what you accept • Important content from DTD: • Default attribute values • Namespace declarations • Entity references • ID types

  29. Navigate with XPath • More robust against unexpected structure • Allow optimization by engine • Easier to code; enhanced programmer productivity

  30. Validate inside your program with schemas

  31. Part IV: Implementation

  32. Write documents in Unicode • Prefer UTF-8 • Smaller in English • ASCII compatible • Normalization • É, ü, ì and so forth • NFC • ICU

  33. Avoid Vendor Lockin; Beware • Opaque, binary data used in place of marked up text. • Over-abbreviated, inobvious names like F17354 and grgyt • APIs that hide the XML • Products that focus on the "Infoset” • Alternate serializations of XML • Patented formats

  34. Hang on to your relational database

  35. Document Namespaces with RDDL <!DOCTYPE html PUBLIC "-//XML-DEV//DTD XHTML RDDL 1.0//EN" "http://www.rddl.org/rddl-xhtml.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rddl="http://www.rddl.org/"> <head> <title>MegaBank Statement Markup Language (MBSML)</title> </head> <p> This is the XML namespace for the <a href="http://developer.megabank.com/xml/">MegaBank Statement Markup Language</a>. </p> <rddl:resource xlink:type="simple" xlink:href="http://developer.megabank.com/xml/spec.html" xlink:role="http://www.w3.org/TR/html4/" xlink:arcrole ="http://www.rddl.org/purposes#normative-reference" > <p> The <a href="http://developer.megabank.com/xml/spec.html">MegaBank Statement Markup Language Specification 1.0</a> </p> </rddl:resource> </body></html>

  36. Pick the correct MIME type • application/xml • Not text/xml! • Don't use charset • application/mathml+xml • image/svg+xml • application/xslt+xml

  37. TagSoup Your HTML

  38. Catalog common resources <?xml version="1.0"?> <catalog xmlns= "urn:oasis:names:tc:entity:xmlns:xml:catalog" > <public publicId= "-//OASIS//DTD DocBook XML V4.2//EN" uri= "file:///opt/xml/docbook/docbookx.dtd"/> </catalog>

  39. Compress if space is a problem //output OutputStream fout = new FileOutputStream("data.xml.gz"); OutputStream out = new GZipOutputStream(fout); OutputFormat format = new OutputFormat(document); XMLSerializer output = new XMLSerializer(out, format); output.serialize(doc); // input InputStream fin = new FileInputStream("data.xml.gz"); InputStream in = new GZipInputStream(fin); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document doc = parser.parse(in); S // work with the document...

  40. To Learn More • This Presentation: http://cafeconleche.org/slides/lxny/effectivexml • Effective XML: 50 Specific Ways to Improve Your XML Documents • Elliotte Rusty Harold • Addison-Wesley, 2003 • ISBN 0-321-15040-6 • $44.99 • http://cafeconleche.org/books/effectivexml

More Related