DOM (Document Object Model)

DOM(Document Object Model) Cheng-Chia Chen

What is DOM? • DOM (Document Object Model) • A tree-based Data model of XML Documents • An API for XML document processing • cross multi-languages • language neutral. • defined in terms of CORBA IDL • language-specific bindings supplied for ECMAScript, java, ….

Document Object Model • Defines how XML and HTML documents are represented as objects in programs • W3C Standard • Defined in IDL; thus language independent • HTML as well as XML • Writing as well as reading • Covers everything except internal and external DTD subsets

Trees • An XML document can be represented as a tree. • It has a root. • It has nodes. • It is amenable to recursive processing.

DOM (Document Object Model) • What is the tree view of the document ? <?xml version=“1.0” encoding=“UTF-8” ?> <TABLE> <TBODY> <TR> <TD>紅樓夢</TD> <TD>曹雪芹</TD> </TR> <TR> <TD>三國演義</TD> <TD>羅貫中</TD> </TR> </TBODY> </TABLE>

Tree view (DOM view) of an XML Docuemnt (document node; root) (element node) (text node) 曹雪芹三國演義羅貫中紅樓夢

DOM Evolution • DOM Level 0: • DOM Level 1, a W3C Standard • DOM Level 2, a W3C Standard • DOM Level 3: W3C Standard: • Document Object Model (DOM) Level 3 Core Specification • Document Object Model (DOM) Level 3 Load and Save Specification • Document Object Model (DOM) Level 3 Validation Specification • DOM Level 3 : W3C Working group notes • Document Object Model (DOM) Level 3 XPath Specification Version 1.0 • Document Object Model (DOM) Level 3 Views and Formatting Specification • Document Object Model (DOM) Level 3 Events Specification Version 1.0 • W3c DOM Working group • W3C DOM Tech Reports

DOM Implementations for Java • Apache XML Project's Xerces parsers: • http://xml.apache.org/xerces2-j/index.html • Oracle/Sun's Java API for XML • http://jaxp.java.net/ • GNU JAXP: • http://www.gnu.org/software/classpathx/jaxp/jaxp.html • Now part of GNU Classpath

Modules • Modules: • Core: org.w3c.dom (L1~L3) • Traversal: org.w3c.dom.traversal (L2) • Xpath, Load and Save, Validation (L3) • Range: org.w3c.dom.range (L2) • HTML: org.w3c.dom.html (L2) • Views: org.w3c.dom.views(L2) • StyleSheets: org.w3c.dom.stylesheets • CSS: org.w3c.dom.css • Events: org.w3c.dom.events (L2) • Only the core,traversal, XPath, L&S, and Validation modules really apply to XML. The others are for HTML.

DOM Trees • Entire document is represented as a tree. • A tree contains nodes. • Some nodes may contain other nodes (depending on node type). • Each document node contains: • zero or one doctype nodes • one root element node • zero or more comment and processing instruction nodes

17 interfaces: Attr CDATASection CharacterData Comment Document DocumentFragment DocumentType DOMImplementation Element Entity EntityReference NamedNodeMap Node NodeList Notation ProcessingInstruction Text plus one exception: DOMException Plus a bunch of HTML stuff in org.w3c.dom.html and other packages org.w3c.dom

The DOM Interface Hierarchy Fundamental Interface NamedNodeMap DOMImplementation NodeList DOMException Node Document CharacterData Comment Attr Text Element Extended Interface DocumentType CDATASection Notation Entity EntityReference ProcessingInstruction DocumentFragment

Steps to use DOM • Creates a parser using library specific code • Use the parser to parse the document and return a DOM org.w3c.dom.Document object. • The entire document is stored in memory. • DOM methods and interfaces are used to extract data from this object

Parsing documents with a (Xerces) DOM Parser Example import com.sun.org.apache.xerces.internal.parsers.*; // import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class DOMParserMaker { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); // work with the document... } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } }}

Parsing process using JAXP • javax.xml.parsers.DocumentBuilderFactory.newInstance() creates a DocumentBuilderFactory • Configure the factory • The factory's newDocumentBuilder() method creates a DocumentBuilder • Configure the builder • The builder parses the document and returns a DOM org.w3c.dom.Document object. • The entire document is stored in memory. • DOM methods and interfaces are used to extract data from this object

JAXP’s DOM plugability mechanism

Parsing documents with a JAXP DocumentBuilder import javax.xml.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class JAXPParserMaker { public static void main(String[] args) { try { DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(); builderFactory.setNamespaceAware(true); DocumentBuilder parser = builderFactory.newDocumentBuilder(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory Document d = parser.parse(args[i]); // work with the document... } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } // end for } catch (ParserConfigurationException e) { System.err.println("You need to install a JAXP aware parser."); }}}

The Node Interface package org.w3c.dom; public interface Node { // NodeType public static final short ELEMENT_NODE = 1; public static final short ATTRIBUTE_NODE = 2; public static final short TEXT_NODE = 3; public static final short CDATA_SECTION_NODE = 4; public static final short ENTITY_REFERENCE_NODE = 5; public static final short ENTITY_NODE = 6; public static final short PROCESSING_INSTRUCTION_NODE = 7; public static final short COMMENT_NODE = 8; public static final short DOCUMENT_NODE = 9; public static final short DOCUMENT_TYPE_NODE = 10; public static final short DOCUMENT_FRAGMENT_NODE = 11; public static final short NOTATION_NODE = 12;

The Node interface • Node Properties • name(qname, uri, lname, prefix), type, value public String getNodeName(); public String getNamespaceURI(); public String getPrefix(); public void setPrefix(String prefix) throws DOMException; public String getLocalName(); public String getNodeValue() throws DOMException; public String setNodeValue(String value) throws DOMException; public shortgetNodeType();

The Node interface • Tree navigation public Node getParentNode(); public NodeList getChildNodes(); public Node getFirstChild(); public Node getLastChild(); public Node getPreviousSibling(); public Node getNextSibling(); public NamedNodeMap getAttributes(); public Document getOwnerDocument(); public boolean hasChildNodes(); public boolean hasAttributes();

parentNode this nextSibling previousSibling firstChild lastChild childNodes Node navigation

The Node interface • Tree Modification public Node insertBefore (Node newNode, Node refNode) throws DOMException; public Node appendChild(Node newNode) throws DOMException; public Node replaceChild (Node newNode, Node refNode) throws DOMException; public Node removeChild(Node node) throws DOMException;

Node manipulation this this.appendChild(newNode) refNode firstChild lastChild childNodes this.insertBefore(newNode, refNode) this.replaceChild(newNode, refNode) This.removeNode(refNode) newNode

The Node interface • Utilities public Node cloneNode(boolean deep); public void normalize(); • merge all adjacent text nodes into one. • CDATASECTION delimiters reserved • No empty text nodes public boolean isSupported(String feature, String version); • Tests whether the DOM implementation implements a specific feature and that feature is supported by this node.

Node (continued) new in DOM 3 • String getTextContent() • returns the text content of this node and its descendants. • i.e., the string-value of the node in xpath view (except for Document, of which getTextContent() is null). • void setTextConent(String arg ) • reset arg as the unique child of this node • boolean isDefaultNamespace(String namespaceURI) • This method checks if the specified namespaceURI is the default namespace or not. • boolean isEqualNode(Node arg) // Tests if two nodes are equal. • When are two nodes equal? •  Same type, names, attributes, value & childNodes • boolean isSameNode(Node other) • Returns whether this node is the same node as the given one. • String lookupNamespaceURI(String prefix) • Look up the namespace URI associated to the given prefix, starting from this node.

short compareDocumentPosition(Node other) • Compare ‘this’ node with ‘other’ node • possible values: Node.DOCUMENT_POSITION_PRECEDING, _FOLLOWING,_CONTAINS,CONTAINED_BY,_DISCONNECTED, _IMPLEMENTATION_SPECIFIC • String getBaseURI() //order: xml:base  entity  document • The absolute base URI of this node or null if the implementation wasn't able to obtain an absolute URI. • Object getUserData(String key) • Retrieves the object associated to a key on this node. • Object setUserData(String key, Object data, UserDataHandler handler) • Associate an object to a key on this node. • handler can handle events (clone, del, renamed etc.) for the node

UserDataHandler (skipped!) • gets called when the node the object is associated to is being cloned, adopted, deleted , imported, or renamed. • can be used by the application to implement various behaviors regarding the data it associates to the DOM nodes. • void handle(short operation, String key, Object data, Node src, Node dst) • This method is called whenever the node for which this handler is registered is imported or cloned. • operation - Specifies the type of operation that is being performed on the node. : { adopt, clone, import, delete, rename } • methods: Doc.adoptNode(), Node.cloneNode(), Doc.importNode(),Node.removeChild(node), Doc.renameNode() • key - Specifies the key for which this handler is being called. • data - Specifies the data for which this handler is being called. • src - Specifies the node being cloned, adopted, imported, or renamed. This is null when the node is being deleted. • dst - Specifies the node newly created if any, or null.

The NodeList Interface • Represent an ordered collection of nodes, without defining or constraining how this collection is implemented. • package org.w3c.dom; public interface NodeList { // 0-based public Node item(int index); // access by position! public int getLength(); } • Why not just List<Node> ? • applicable to Java only. • but DOM is defined not only for Java.

The NamedNodeMap interface • Represent collections of nodes that can be accessed by name. public interface NamedNodeMap { public Node item(int index); // same as NodeList public int getLength(); public Node getNamedItem(String name); // key = nodeName public Node setNamedItem(Node arg) throws DOMException; // insert/replace node depending on if the map has a node with the same name as arg.getNodeName() // old node returned if this is a replacement! public Node removeNamedItem(String name)throws DOMException; // Introduced in DOM Level 2: key=URI+localName public Node getNamedItemNS(namespaceURI, localName); public Node setNamedItemNS(Node arg) throws DOMException; public Node removeNamedItemNS(namespaceURI, localName) throws DOMException ; }

DOMStringList, NameLIst • DOMStringList// List<String> • an ordered collection of DOMString(i.e., JavaString) values. • boolean contains(String str) • Test if a string is part of this DOMStringList. • +getLength() + item(int) • NameList// List< ( prefix Name,NamespaceURI) > • an ordered collection of pairs of (prefix) name and namespace values (which could be null values). • int getLength() • String getName(int index) • String getNamespaceURI(int index) • boolean contains(String str) • Test if a name is part of this NameList. • boolean containsNS(String namespaceURI, String name) • Test if the pair namespaceURI/name is part of this NameList.

NodeReporter import javax.xml.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class NodeReporter { public static void main(String[] args) { try { DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = builderFactory.newDocumentBuilder(); NodeReporter iterator = new NodeReporter(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory Document doc = parser.parse(args[i]); iterator.followNode(doc); } catch (SAXException ex) { System.err.println(args[i] + " is not well-formed."); } catch (IOException ex) { System.err.println(ex); } } } catch (ParserConfigurationException ex) { System.err.println("You need to install a JAXP aware parser."); } } // end main

// note use of recursion public void followNode(Node node) { processNode(node); if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { followNode(children.item(i)); } } } public void processNode(Node node) { String name = node.getNodeName(); String type = typeName[node.getNodeType()]; System.out.println("Type " + type + ": " + name); }

Type2TypeName Public String[ ] typeName = new String[]{ "UnknownType“ , "Element“, "Attribute“, "Text“, "CDATA Section“, "Entity Reference“, "Entity“, "Processing Instruction“, "Comment“, "Document“, "Document Type Declaration“, "Document Fragment“, "Notation“, } }

Interface nodeName nodeValue attributes Attr name of attribute value of attribute null CDATASection#cdata-section content null Comment#comment content null Document#document null null DocumentFragment #document-fragment null null DocumentType document type name null null Element tag name null NamedNodeMap Entity entity name null null EntityReference null name of entity referenced null Notation notation name null null ProcessingInstruction content excluding target target null Text#text content of the text node null Values of NodeName, NodeValue and attributes in a Node

The Document Node • The root node representing the entire document; not the same as the root element • Contains: • zero or more processing instruction nodes • zero or more comment nodes • zero or one document type node • one element node

The Document Interface package org.w3c.dom; public interface Document extends Node { public DocumentType getDoctype(); public DOMImplementation getImplementation(); public Element getDocumentElement(); public String getDocumentURI()V3; // =null if not specified or create using DOMImplementation.createDocuemnt() public NodeList getElementsByTagName(String tagname); public NodeList getElementsByTagNameNS(String NamespaceURI, String localName); public Element getElementById(String elementId);

The Document Interface // Factory methods public Element createElement(String tagName) throws DOMException; public Element createElementNS(String namespaceURI, String qName) throws DOMException; public DocumentFragment createDocumentFragment(); public Text createTextNode(String data); public Comment createComment(String data); public CDATASection createCDATASection(String data) throws DOMException; public ProcessingInstruction createProcessingInstruction(String target, String data) throws DOMException; public Attr createAttribute(String name) throws DOMException; public Attr createAttributeNS(String namespaceURI, String qName) throws DOMException; public EntityReference createEntityReference(String name) throws DOMException; public Node importNode(Node importedNode, boolean deep) throws DOMException; }

New in Document V3 • Node adoptNode(Node node): • adopt(i.e., move) trees rooted at node from its owner document to this document. • It is detached from its parent if it has one. • c.f. importNode(Node, deep) // this is a copy • DOMConfiguration getDomConfig() • DOMCOnfiguration is a table of (key, value) parameters used to control how DOCUMENT.normalizeDocument() behaves. • normalizeDocument() • acts as if the document was going through a save and load cycle, putting the document in a "normal" form.

use case: • DOMConfiguration docConfig = myDocument.getDomConfig(); docConfig.setParameter("infoset", Boolean.TRUE); myDocument.normalizeDocument(); • Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException • Rename an existing node of type ELEMENT_NODE or ATTRIBUTE_NODE. • getXmlVersion(), getXmlEncoding(), getStandalone():boolean • get respective value from the XML declaration of a document. • <?xml version="1.1" encoding="UTF-8" standard="true" ?> • setXmlVersion(String), setXMLStandalone(boolean)

Element Nodes • Represents a complete element including its • start-tag, • end-tag, and • content • Content may contain: • Element nodes • ProcessingInstruction nodes • Comment nodes • Text nodes • CDATASection nodes • EntityReference nodes

The Element Interface public String getTagName(); // = getNodeName(); public NodeList getElementsByTagName(String name); public NodeList getElementsByTagNameNS(String rui, String localName); public String getAttribute(String name); public String getAttributeNS(String uri, String localName); public void setAttribute(String name, String value) throws DOMException; public void setAttributeNS(String uriURI, String qName, String value) throws DOMException; public void removeAttribute(String name) throws DOMException; public void removeAttributeNS(String uri, String localName) throws DOMException; public Attr getAttributeNode(String name); public Attr getAttributeNodeNS(String namespaceURI, String localName); public Attr setAttributeNode(Attr newAttr) throws DOMException; public Attr setAttributeNodeNS(Attr newAttr) throws DOMException; public Attr removeAttributeNode(Attr oldAttr) throws DOMException;

Example application • RSS-based list of Web logs <?xml version="1.0"?>  <weblogs> <log> <name>MozillaZine</name> <url>http://www.mozillazine.org</url> <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl> <ownerName>Jason Kersey</ownerName> <ownerEmail>kerz@en.com</ownerEmail> <description>THE source for news on the Mozilla Organization. DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description> <imageUrl></imageUrl> <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif </adImageUrl> </log> … </weblogs>

DOM Design • Want to find all URLs in the logs • The character data of each url element needs to be read. Everything else can be ignored. • The getElementsByTagName() method in Document gives us a quick list of all the url elements.

The programWeblogsDOM .java

CharacterData interface • Represents things that are basically text holders • Super interface of • Text • Comment • CDATASection

The CharacterData Interface • Note: applicable to Comment, Text and CDATASection public interface CharacterData extends Node { // content retrieval public String getData() throws DOMException; public int getLength(); public String substringData(int offset, int count) throws DOMException; // 0-based // content modification public void setData(String data) throws DOMException; public void appendData(String arg) throws DOMException; public void insertData(int offset, String arg) throws DOMException; public void deleteData(int offset, int count) throws DOMException; public void replaceData(int offset, int count, String arg) throws DOMException; }

Text Nodes • Represents the text content of • an element or • an attribute • Contains only pure text, no markup • Parsers will return a single maximal text node for each contiguous run of pure text. • Editing may change this.

The Text Interface public interface Text extends CharacterData { public Text splitText(int offset) throws DOMException; • split this into two, this becomes the first part and the last part is returned. String getWholeText() • Returns all text of Text nodes logically-adjacent to this node, concatenated in document order. boolean isElementContentWhitespace() • Returns whether this text node contains ignorable whitespace. Text replaceWholeText(String content) • Replaces the text of the current node and all logically-adjacent text nodes with the specified text. • return the Text node created with the new specified content.}

CDATA section Nodes • Represents a CDATA section like this example from a hypothetical SVG tutorial: <p>You can use a default <code>xmlns</code> attribute to avoid having to add the svg prefix to all your elements:</p> <![CDATA[ <svg xmlns="http://www.w3.org/2000/svg" width="12cm" height="10cm"> <ellipse rx="110" ry="130" /> <rect x="4cm" y="1cm" width="3cm" height="6cm" /> </svg> ]]> • No children

The CDATASection Interface // no additional methods other than those form Text public interface CDATASection extends Text { }

DOM (Document Object Model)