1 / 23

XML And XPath

XML And XPath. DSA Term 2 Week 14. Lecture overview. Matters arising Character coding Well-formed XML Creating simple XML files Placename to BBC code Introduction to XPath. Character Coding. Character set ISO 8549 - 1 Byte 0 - 127 are ASCII

ailsa
Download Presentation

XML And XPath

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML And XPath DSA Term 2 Week 14 DSA/2006/week 14

  2. Lecture overview • Matters arising • Character coding • Well-formed XML • Creating simple XML files • Placename to BBC code • Introduction to XPath DSA/2006/week 14

  3. Character Coding • Character set • ISO 8549 - 1 Byte • 0 - 127 are ASCII • 128- 255 vary depending on the part of the standard • 15 different character maps • ISO-8859-1 - Latin -1 - the default for HTML • ISO-8859-2 – Central European • A document must be on one encoding • problem of mixing characters e.g. an Arabic quotation in a Cyrillic text • UTF-8 - Unicode 1- 4 byte variable length to support a huge range of international languages in a single code • ASCII is included as characters 0-127 • Ensures that the internet is truly multi-lingual • Key invention by Ken Thompson of self-synchronisation allowing character boundaries to be detected • Character references in HTML • Named ° • decimal &176; • Hexadecimal &#B0; DSA/2006/week 14

  4. Defining the Encoding • Encodings in HTML • In a meta-tag • <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> • In the xml processing instruction • <?xml version="1.0" encoding="ISO-8859-1"?> • In the HTTP content header • Content-Type: text/html; charset=ISO-8859-1 • Setting Encoding in PHP • header("Content-type: text/html; charset=UTF-8"); • Setting encoding in the Browser • Firefox • View/Character Encoding DSA/2006/week 14

  5. Design a simple XML file • Design an XML vocabulary to represent pairs of place names and codes • Bristol 1263 • Bath 1123 • First review XML structure DSA/2006/week 14

  6. Example <MapSet> <Map id="P2" desc="P Block level 2"> <room id="2P2"> <area shape="rect" coords="118,39,138,68"/> <type>Staff Room</type> <occupant>Tony Solomonides</occupant> </room> <room id="2P3"> <area shape="rect" coords="141,40,162,69"/> <type>Staff Room</type> <occupant>Richard Lawson</occupant> </room> <room id="2P4"> <area shape="poly" coords="201,40,234,40,234,118,164,119,163,71,200,71"/> <type>Office</type> <occupant>Eleanor Gibbons</occupant> <occupant>Dee Evans</occupant> <occupant>Ali Jack</occupant> </room> …. </Map> </MapSet> DSA/2006/week 14

  7. Well-formed XML documents (1) Every XML document must be well-formed and must therefore adhere to the following rules (among others): • Every start-tag must have a matching end tag. • Elements may nest but must not overlap. <name>Anna<em>Coffey</em></name> - √ <name><em>Anna</name>Coffey</em> - × • There must be exactly one root element. • Attribute values must be quoted. • An element must not be quoted. • Comments and processing instructions may not appear inside tags. • No unescaped < or & signs may occur in the character data of an element. DSA/2006/week 14

  8. Well-formed XML documents (2) Element names are case sensitive - <NAME>, <name>, <Name> & <NaMe> are four different element types. No white spaces in element name - <First Name> not allowed; <First_Name> OK. Element names cannot start with the letters “XML” or “xml” – reserved terms. Element names must start with a letter or a underscore. Element names cannot start with a number but numbers may be embedded within an element name - <2you> not allowed; <me2you> is OK. Attribute names are constrained by the above rules for element names. Entity references are used to substitute specific characters. There are five predefined entities built into XML: Entity Char Notes &amp; & Do not use inside processing instructions &lt; < Use inside attribute values quoted with “. &gt; > Use after ]] in normal text and inside processing instruction. &quot; “ Use inside attribute values quoted with “. &apos; ‘ Use inside attribute values quoted with ‘. Map DSA/2006/week 14

  9. Errors • Look at the listing of the XML file and identify all the places which prevent this XML from being well-formed DSA/2006/week 14

  10. <Map id=P2 desc="P Block level 2'> <room id="2P2"> This is a nice big office <area rect coords="118,39,138,68"> <typo>Staff Room</typo> <occupant>Tony Solomonides</occupant> </Room> <room id="2P3"> <area rect coords="141,40,162,69"></area> <typo>Staff Room</typo> <occupant>”Richard Lawson”</occupant> </Room> <room id="2P4"> <area poly coords="201,40,234,40,234,118,164,119,163,71,200,71"/> <typo>Office</typo> <occupant>Eleanor Gibbons</occupant> <person>Dee Evans</person> <occupant>Ali Jack</occupant </Room> --- DSA/2006/week 14

  11. Task • Draw the structure • Use ER notation • Attributes in the Entity • Cross-foot notation for one-many, optional • Identify any restricted sets of values (ennumerated types) • In the lab, QSEE will allow you to define the structure and generate the schema definition (XML Schema or DTD) DSA/2006/week 14

  12. DSA/2006/week 14

  13. XPATH • Core language for selecting nodes in XML • Version 1.0 used in XSLT 1.0 • client-side in Browsers • xalan engine • w3.schools Tutorial is for XPath 1.0 • SimpleXML in PHP • Version 2.0 used in XSLT 2.0 • Saxon parser • XQuery 1.0 • Differences • Code data structure in 2.0 is a node sequence • Full support for all XML schema datatypes • Two kinds of equality operators • Larger function library DSA/2006/week 14

  14. XPath Language • Not a programming language • Expressions to be evaluated • Focus on • Navigation in a tree structure • Multiple directions or ‘axes’ • Down to children (child axis) • Up to parent (parent axis) • Down to attributes (attribute axis) • Across to siblings (sibling axis) • Operators • Functions DSA/2006/week 14

  15. DSA/2006/week 14

  16. XPath operators • Arithmetic operators + - * div idiv mod • Value comparisons eq, le, ge, gt, lt • Sequence comparisons = , != = is true if there are common elements != is true if there are no common elements (1,2,3) = (2,3,4) is true (1,2,3) != (2,3,4) is also true not ((1,2,3) = (2,3,4) ) is false • Logical operators and, or, not() DSA/2006/week 14

  17. large function library • count (seq) , max((seq)) ,min((seq)), average • count(1,2,3) = 3 • max, min • string functions • string-length(‘abc’) • tokenize(‘a,b,c’,’,’) • string-join((a,b,c),’, ‘) DSA/2006/week 14

  18. Using the eXist database • eXist database as an XPath / XQuery engine. • Rest interface • ..exist/rest/db/chriswallace/rooms?_query=//Map • Java client • Sandbox (using Ajax to do dynamic syntax checking) • Context is the whole database • The demo database includes • the whole text for Romeo and Juliet • the mondial world database DSA/2006/week 14

  19. Examples • all Rooms • /MapSet/Map/room • //room • room 2P5 • //room[@id=‘2P5’] • the occupants of room 2P4 • //room[@id=‘2P4’]/occupant • the roomNo of the room which Colin Fudge occupies • //room[occupant = ‘Colin Fudge’]/@id • the number of occupants of 2P4 • count(//room[@id=‘2P4’]/occupant) • The floor of Ali Jack’s room • //room[occupant = ‘Ali Jack’]/../@desc DSA/2006/week 14

  20. Notes • Note how = tests if a person is amongst the occupants • To ‘serialise’ an attribute use string() • See how ../ allows navigation to the parent element DSA/2006/week 14

  21. Examples for you • The room number for Richard Lawton • The coordinates of room 2P2 • All rooms with poly shape • Who are Ali Jack’s office mates? DSA/2006/week 14

  22. XML design • Rooms is a mixture of text elements and attributes. • Could be all attributes – what would change? • Could be no attributes – what would change? • For the workshop exercise use elements instead of attributes – its simpler even if more verbose • Generally, what do the experts recommend? DSA/2006/week 14

  23. Workshop • Create a simple XML file containing pairs of Place names and BBC codes • Change the PHP script to accept a placename • Read the new xml file and decode the name to get the code using PHP SimpleXML interface and xpath(‘’) DSA/2006/week 14

More Related