Extensible Markup Language: XML

Extensible Markup Language: XML • XML developed by World Wide Consortium’s (W3C’s) XML Working Group (1996) • XML portable, widely supported technology for describing data • XML quickly becoming standard for data exchange between applications

15.2 XML Documents • XML marks up data using tags, which are names enclosed in angle brackets < > • All tags appear in pairs: <myTag> .. </myTag> • Elements: units of data (i.e., everything included between a start tag and its corresponding end tag) • Root element contains all other document elements • Tag pairs cannot appear interleaved:<a><b></a></b> Must be: <a><b></b></a> • Nested elements form hierarchies (trees) Thus: What defines an XML document is not its tag names but that it has tags that are formatted in this way.

End tag has format </start tag name> Optional XML declaration includes version information parameter 1<?xml version = "1.0"?> 2 3 4  5 6<article> 7 8 <title>Simple XML</title> 9 10 <date>December 21, 2001</date> 11 12 <author> 13 <firstName>John</firstName> 14 <lastName>Doe</lastName> 15 </author> 16 17 <summary>XML is pretty easy.</summary> 18 19 <content>In this chapter, we present a wide variety of examples 20 that use XML. 21 </content> 22 23</article> article.xml XML comments delimited by <!– and --> Root element contains all other document elements article Because of the nice <tag>.. </tag> structure, the data can be viewed as organized in a tree: title date author summary content firstName lastName

An I-sequence structured as XML <?xml version = "1.0"?> <!– I-sequence structured with XML. --> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID> <DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc </DATA> </SEQ> </SEQUENCEDATA> SEQUENCEDATA SEQ TYPE DATA NAME ID

Parsing and displaying XML • XML is just another data format • We need to write yet another parser • No more filters, please! ? • No! XML is becoming standard • Many different systems can read XML – not many systems can read our I-sequence format.. • Thus, parsers exist already

XML document opened in Internet Explorer Each parent element/node can be expanded and collapsed Minus sign Plus sign

XML document opened in Mozilla Again: Each parent element/node can be expanded and collapsed (here by pressing the minus, not the element)

Attribute (name-value pair, value in quotes): element contact has the attribute type which has the value “from” Empty elements do not contain character data. The tags of an empty element may be written in one like this: <myTag /> Attributes 1 <?xml version = "1.0"?> 2 3  4  5 6<letter> 7 <contact type = "from"> 8 <name>Jane Doe</name> 9 <address1>Box 12345</address1> 10 <address2>15 Any Ave.</address2> 11 <city>Othertown</city> 12 <state>Otherstate</state> 13 <zip>67890</zip> 14 <phone>555-4321</phone> 15 <flag gender = "F" /> 16 </contact> 17 18 <contact type = "to"> 19 <name>John Doe</name> 20 <address1>123 Main St.</address1> 21 <address2></address2> 22 <city>Anytown</city> 23 <state>Anystate</state> 24 <zip>12345</zip> 25 <phone>555-1234</phone> 26 <flag gender = "M" /> 27 </contact> 28 29 <salutation>Dear Sir:</salutation> 30 Data can also be placed in attributes: name/value pairs letter.xml

31 <paragraph>It is our privilege to inform you about our new 32 database managed with <technology>XML</technology>. This 33 new system allows you to reduce the load on 34 your inventory list server by having the client machine 35 perform the work of sorting and filtering the data. 36 </paragraph> 37 38 <paragraph>Please visit our Web site for availability 39 and pricing. 40 </paragraph> 41 42 <closing>Sincerely</closing> 43 44 <signature>Ms. Doe</signature> 45 </letter> letter.xml

Intermezzo 1 http://www.daimi.au.dk/~chili/CSS/Intermezzi/30.10.1.html All files found from the Example Programs page • 1. Finish this i2xml.py filter so it translates a list of Isequence objects into XML (following the above structure) and saves it in a file. Assume the list contains only one Isequence object. Use your module with this driver program and translate this Fasta file into XML. Load the resulting XML file into a browser. • Change the XML structure defined by your filter so that TYPE is no longer a tag by itself but an attribute of the SEQ tag (see page 496). • Modify your i2xml filter so that it can now translate a list of several Isequence objects into one XML file, using the structure from part 2. Test your program with the same driver on this Fasta file.

solution from Isequence import Isequence import sys # Save a list of Isequences in XML class SaveToFiles: """Stores a list of ISequences in XML format""" def save_to_files(self, iseqlist, savefilename): try: savefile = open(savefilename, "w") print >> savefile, "<?xml version = \"1.0\"?>" print >> savefile, "<SEQUENCEDATA>" for seq in iseqlist: print >> savefile, ’ <SEQ type="%s">’%seq.get_type() print >> savefile, " <NAME>%s</NAME>"%seq.get_name() print >> savefile, " <ID>%s</ID>"%seq.get_id() print >> savefile, " <DATA>%s</DATA>"%seq.get_sequence() print >> savefile, " </SEQ>" print >> savefile, "</SEQUENCEDATA>" savefile.close() except IOError, message: sys.exit(message)

solution XML file loaded in Internet Explorer

Parsers and trees • We’ve already seen that XML markup can be displayed as a tree • Some XML parsers exploit this. They • parse the file • extract the data • return it organized in a tree data structure called a Document Object Model article title date author summary content firstName lastName

15.4 Document Object Model (DOM) • DOM parser retrieves data from XML document • Hierarchical tree structure called a DOM tree • Each component of an XML document represented as a tree node • Parent nodes contain child nodes • Sibling nodes have same parent • Single root (or document) node contains all other document nodes

article contents title author date firstName summary lastName DOM tree of previous example sibling nodes one single document root node parent node <?xml version = "1.0"?>   <article> <title>Simple XML</title> <date>December 21, 2001</date> <author> <firstName>John</firstName> <lastName>Doe</lastName> </author> <summary>XML is pretty easy.</summary> <content>In this chapter, we present a wide variety of examplesthat use XML. </content> </article> Fig. 15.6 Tree structure for article.xml. child nodes

Python provides a DOM parser! • all nodes have name (of tag) and value • text (incl. whitespace) represented in nodes with tag name #text #text #text Simple XML <?xml version = "1.0"?>   <article> <title>Simple XML</title> <date>December 21, 2001</date> <author> <firstName>John</firstName> <lastName>Doe</lastName> </author> <summary>XML is pretty easy.</summary> <content>In this chapter, we present a wide variety of examplesthat use XML. </content> </article> title #text #text Dec..2001 date #text #text John #text firstName article author #text #text Doe #text lastName #text XML..easy. summary #text #text #text In this..XML. content #text

Parse XML document and load data into variable document import sys from xml.dom.minidom import parse # stuff we have to import from xml.parsers.expat import ExpatError # the book uses an old version .. << open xml file>> try: document = parse( file ) file.close() except ExpatError: sys.exit( "Error processing XML file" ) rootElement = document.documentElement print "Here is the root element of the document: %s" % rootElement.nodeName # traverse all child nodes of root element for node in rootElement.childNodes: print node.nodeName # get first child node of root element child = rootElement.firstChild print "\nThe first child of root element is:", child.nodeName print "whose next sibling is:", # get next sibling of first child sibling = child.nextSibling print sibling.nodeName print “Text inside “+ sibling.nodeName + “ tag is”, textnode = sibling.firstChild print textnode.nodeValue print "Parent node of %s is: %s" % ( sibling.nodeName, sibling.parentNode.nodeName ) revisedfig16_04.py get root element of the DOM tree, documentElement attribute refers to root node nodeName refers to element’s tagname List of a node’s children Other node attributes: firstChild nextSibling nodeValue parentNode

Program output Here is the root element of the document: article The following are its child elements: #text title #text date #text author #text summary #text content #text The first child of root element is: #text whose next sibling is: title Text inside "title" tag is Simple XML Parent node of title is: article #text #text Simple XML title #text #text Dec..2001 date #text #text John #text firstName article author #text #text Doe #text lastName #text XML..easy. summary .. print “Text inside “+ sibling.nodeName + “ tag is”, textnode = sibling.firstChild # print text value of sibling print textnode.nodeValue .. #text #text #text In this..XML. content #text

Parsing XML sequence? • We have i2xml filter – we want xml2i also • Don’t have to write XML parser, Python provides one • Thus, algorithm: • Open file • Use Python parser to obtain the DOM tree • Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA Ignoring whitespace nodes, we have to search a tree like this: SEQ (type) SEQ (type) DATA NAME ID NAME ID DATA

from Isequence import Isequence import sys from xml.dom.minidom import parse from xml.parsers.expat import ExpatError class Parser: """Parses xml file, stores sequences in Isequence list""" def __init__( self ): self.iseqlist = [] # make empty list def parse_file( self, loadfilename ): try: loadfile = open( loadfilename, "r“ ) except IOError, message: sys.exit( message ) # Use Python's own xml parser to parse xml file: try: dom = parse( loadfilename ) loadfile.close() except ExpatError: sys.exit( "Couldn't parse xml file“ ) # now dom is our dom tree structure. Was the xml file a sequence file? if dom.documentElement.nodeName == "SEQUENCEDATA“ : # recursively search the parse tree: for child in dom.documentElement.childNodes: self.traverse_dom_tree( child ) else: sys.exit( "This is not a sequence file" ) return self.iseqlist part 1:2

def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ : # marks the beginning of a new sequence self.iseq = Isequence() # make new Isequence object self.iseqlist.append( self.iseq ) # add to list newformat = 0 # the type should be an attribute of the SEQ tag. # go through all attributes of this node: for i in range( node.attributes.length ): if node.attributes.item(i).name == "type“ : # good, found a 'type' attribute newformat = 1 # get the value of the attribute, put it in the Isequence: self.iseq.set_type( node.getAttribute( "type" ) ) break ifnot newformat: # we didn't find any 'type' attribute, this is old format print"No 'type' attribute in element SEQ" # next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child ) elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue ) part 2:2 SEQ (type) NAME ID DATA

What if the XML sequence format changes? • Now the name of the finder of the sequence is also stored as a new tag: SEQUENCEDATA SEQ (type) SEQ (type) DATA NAME FOUNDBY FOUNDBY NAME ID ID DATA

Robustness of XML format • Our xml2i filter still works: • Can’t extract the finder information: ignores the foundby node: • But: doesn’t crash! Still extracts other information • Easy to incorporate new info def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ : .. # next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child ) elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue ) SEQ (type) DATA FOUNDBY NAME ID

Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >BiRC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCGGCGCTCTATTACGCGCGATCGACCC .. Our Fasta parser would go wrong: for line in lines: if line[0] == '>': # new sequence starts items = line.split() #put new Isequence obj. in list .. elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence( line.strip() ) # skip trailing newline

XML robust • So, the good thing about XML is that it is robust because of its well-defined structure • Widely used, i.e. this overall tag structure won’t change • Parsers available in Python already: • Read XML into a DOM tree • DOM tree can be traversed but also manipulated (see next slide) • Read XML using so-called SAX method

See all the methods and attributes of a DOM tree on pages 537ff Possible to manipulate the DOM tree using these methods: add new nodes, remove nodes, set attributes etc.

Remark: book uses old version of DOM parser • XML examples in book won’t work (except the revised fig16.04) • Look in the presented example programs to see what you have to import • All the methods and attributes of a DOM tree on pages 537ff are the same

Intermezzo 2 http://www.daimi.au.dk/~chili/CSS/Intermezzi/30.10.2.html • Copy this file and take a look at it in your editor:/users/chili/CSS.E03/Intermezzi/data.xmlAny idea what this data is? • Open the file in a browser. Expand and collapse nodes by clicking the - and + symbols. Do you see the structure of the tree? Any idea what the data represents now? • Copy this program to the same directory. Run it and find the name of Jakob's mother's father's mother. See how the program works? • Modify the program so it reports the birth year of the current person as well as the name. • Enhance the program so the user can also go back to the son or daughter of the current person. See table on page 537. • If you have time: Enhance the program so it prints the current person's mother-in-law, if she exists.

solution name = person.getAttribute( "n" ) print( "%s" %name ) if name != 'Jakob‘ : print"%s's mother in law is“ %name , parentNode = person.parentNode # parentNode is either an 'm' or an 'f' node. If it is a mother # node, we need the father node, and vice versa: if parentNode.nextSibling: spouse = parentNode.nextSibling.firstChild else: spouse = parentNode.previousSibling.firstChild # Now we need the mother of the spouse: for childNode in spouse.childNodes: if childNode.nodeName == 'm‘ : print childNode.firstChild.getAttribute( 'n' ) break input = raw_input( "Report (m)other or (f)ather or (o)ffspring of %s? “ %name ) if input != 'm'and input != 'f'and input != 'o‘ : break if input == 'o‘ : print"\n" + name + "'s offspring is“, person = person.parentNode.parentNode else: for child in person.childNodes: if child.nodeName == input: if input == 'm‘ : print"\nMother of “ + name + " is“, elif input == 'f': print"\nFather of “ + name + " is“, person = child.firstChild break

Extensible Markup Language: XML