XML: A Meta-language for Describing Data

XML CSC207 – Software Design Summer 2011

Markup languages • A markup languageis used to tell a printer (a person!) how to lay out text on the page. • SGML: from about 1980 • Standard Generalized Markup Language • HTML • HyperText Markup Language • XML: structure allows description of data • need description of “tags”

XML • Extensible Markup Language (XML) is a meta-language that describes the content of a document Java – portable language XML – portable data • XML does not specify the tag set or the grammar of the language • Tag set: markup tags that have meaning to a language processor • Grammar: defines correct usage of language’s tags

Sample XML Header <?xmlversion="1.0"?> <catalog> <book> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price currency = “USD”>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre>\ <price currency =“USD”>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> </catalog> Root element Tags Element End tag Attribute

Rules for well-formed XML • Elements that contain data must have start and end tags • Empty tags must be closed <br /> or <br> </br> • Elements should not overlap • Bad Nesting: <trunk> <branch> </trunk> </branch> • All attribute values must be wrapped in quotes <a href="newpage.html"> • XML is case sensitive (unlike HTML): <TAG> and <Tag> are treated differently. • Standard: use lower case.

More Rules • A document begins with: an XML Declaration <?xml version="1.0" encoding="UTF-8"?> and perhaps a DocType Declaration: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!DOCTYPE publications SYSTEM "publications.dtd"> • A DTD • defines the tags and relationships among tags • Defines the syntax and grammar of an application-specific tag language • Root element immediately follows; encloses entire content of the document. <book> everything that’s part of the book </book>

HTML vs. XML <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 //EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>Jim Clarke</title> </head> <BODY> <H1> Jim Clarke </H1> <P> I am a Senior Lecturer ... <A HREF="http://www.cs.toronto.edu/">Dept. of Computer Science</A> ... <p> Here are some links to topics with which I have various connections: <ul> <li> ... • XML • fundamentally separates content from presentation • Allows any tag or grammar to be used <Book> … </Book> • HTML • specifies presentation • Defines a set of legal tags, as well as grammar <Table> ... </Table> • Both are based on SGML – Standard Generalized Markup Language

XML Parser • What is an XML parser? • Software that reads and parses XML • Passes data to the invoking application • The application does something useful with the data • Since XML is a standard, we can write generic programs to parse XML data • Frees the programmer from writing a new parser each time a new data format comes along

XML Parser Two types of parser • DOM (Document Object Model) • Reads the entire document into memory in a tree structure • SAX (Simple API for XML) • Event driven API • Sends events to the application as the document is read

Document Object Model (DOM) • Cross-language API for representing XML documents as trees • Easier to manipulate than strings or streams • But may require a lot of memory • Several implementations in Java • This course uses org.jdom • In Python, xml.dom is standard • xml.dom.minidom doesn’t have everything, but is easy to use and fast.

JDOM Rules • Every document becomes an object of type Document • This has a single child of type Element • The root element of the document • Its children may be: • Other elements, Text objects, Other things that we won't worry about • ElementClass also provide us methods to access the element: • getName() The name of the element, i.e. the Tag name. • getAttributes() This returns the complete set of attributes for this element, as a List of Attribute objects in no particular order, or an empty list if there are none. • getText() The text that might be contained in the element. • See also: setName(), setAttributes(), setAttribute(), setText().

JDOM Rules • Attribute Class also provide us methods to access the Attribute Object: • getName() This will retrieve the name of the Attribute. • setName(x) This will set the name of the attribute to x. • getValue() Access the attribute values as string. • setValue(x) Sets the attribute value to the string x. • There is also: • getDoubleValue(), getFloatValue(), getDoubleValue(), getDoubleValue()

JDOM Key Objects • Document • JDOM Document • Element • XML Tag and its content • SAXBuilder • Creating JDOM Document from an XML file • XMLOutputter • Writing an XML file to a JDOM Document

Tree Structure • Let’s look at this document: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <body> <h1>Title</h1> <p>A <em> word </em></p> </body> </html>

Using JDom public static void main(String[] args) { try { String filename = args[0]; // Build document tree SAXBuilder builder = new SAXBuilder(); Document doc = builder.build(filename); // Show top-level elements (next slide) } catch ( JDOMException je ) { System.err.println ( je.getMessage()); System.exit(1); } catch ( IOException ioe ) { System.err.println ( "IOException: Cannot open the file for some reason." ) ; System.exit ( 1 ) ; } } Build the DOM Tree

Iterate over children Get root element // Show top-level elements Element root = doc.getRootElement(); Iterator ic = root.getChildren().iterator(); while (ic.hasNext()) { Element elt = (Element) ic.next(); System.out.println(elt.getName()); } Get all children (excluding text)

Input and output Input <?xml version=“1.0” ?> <book> <h1>First heading</h1> <p>First <em>paragraph</em>.</p> <p><em>Second paragraph.</em></p> </book> Output book h1 p em p em

Printing the tree of nodes public static void descend(Element elt, int depth) { for (int i = 0; i < depth; ++i) { System.out.print(" "); } System.out.println(elt.getName()); for (Element child : elt.getChildren()) { descend(child, depth+1); } }

DBLPAnalyzer • To save the info of an author’s publications into an XML file • Creating the XML Document • Saving the XML Document • Retrieving the XML Document

Further Reading • DTD • Document Type Definition • XPath • XML Path Language • XSLT • EXtensible Stylesheet Language Transformation

References • www.jdom.org • www.dom4j.org • http://www.jdom.org/docs/apidocs/ • http://www.ibiblio.org/xml/books/xmljava/chapters/index.html • http://www.javaworld.com/javaworld/jw-05-2000/jw-0518-jdom.html

A2 Comments

Design / Exception comments: • In some cases, exceptions are caught without any action being taken, which is confusing. • Further exception handling is needed. • Your exception handling code is done only in one location, which handles all exceptional cases for each of the three methods (initRiskTypes, readRiskItemSpec, printRiskValues). Each of these three operations could have different exceptional conditions, but they are all handled in the same way.

Design / Exception comments: • Exceptions are meant to handle *specific* exceptional behaviour, catching all exceptions with such a general class as Exception is not the correct way to program with exception handling. See the course notes on exception handling for more details. • Code is poorly unit tested. Coverage is very minimal, with only one method tested. More concerning is that your tests are neither documented nor explained. • Code contains no unit testing !!

Style comments: • Each instance variable should be declared at the beginning of the class. • Each instance variable should have a comment explaining it's purpose or use. • Would be best to have a class level comment describing what the class represents, not just what it contains • I would also recommend the use of more whitespace. If not viewing your code in an editor that provides syntax highlighting, your code would be extremely difficult to read. • Javadoc comments: • At a very minimun, you need to have @param, @return (if applicable), @throws (if applicable), and a method level comment briefly describing the action taken within.

Midterm Review

XML: A Meta-language for Describing Data