1 / 65

XML

XML. eXtensible Markup Language. Introduction and Motivation. XML vs. HTML. HTML is a HyperText Markup language Designed for a specific application, namely, presenting and linking hypertext documents XML describes structure and content (“semantics”)

Download Presentation

XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML eXtensible Markup Language http://www.cs.huji.ac.il/~dbi

  2. Introduction and Motivation http://www.cs.huji.ac.il/~dbi

  3. XML vs. HTML • HTML is a HyperText Markup language • Designed for a specific application, namely, presenting and linking hypertext documents • XML describes structure and content (“semantics”) • The presentation is defined separately from the structure and the content http://www.cs.huji.ac.il/~dbi

  4. An Address Book asan XML document <addresses> <person> <name> Donald Duck</name> <tel> 04-828-1345 </tel> <email> donald@cs.technion.ac.il </email> </person> <person> <name> Miki Mouse</name> <tel> 03-426-1142 </tel> <email>miki@yahoo.com</email> </person> </addresses> http://www.cs.huji.ac.il/~dbi

  5. Main Features of XML • No fixed set of tags • New tags can be added for new applications • An agreed upon set of tags can be used in many applications • Namespaces facilitate uniform and coherent descriptions of data • For example, a namespace for address books determines whether to use <tel> or <phone> http://www.cs.huji.ac.il/~dbi

  6. Main Features of XML (cont’d) • XML has the concept of a schema • DTD and the more expressive XML Schema • XML is a data model • Similar to the semistructured data model • XML supports internationalization (Unicode) and platform independence (an XML file is just a character file) http://www.cs.huji.ac.il/~dbi

  7. XML is Self-Describing Data • Traditionally, a data file is just a bit stream • Only a program that reads or writes this file has the details about • How to break the bit stream into records • How to break each record into fields • The type of each data field • Over the years, companies retained valuable data (e.g., on magnetic tapes), but lost the programs that have the above information • As a result, the data was practically lost • It cannot happen with XML data http://www.cs.huji.ac.il/~dbi

  8. XML is the Standard forData Exchange • Web services (e.g., ecommerce) require exchanging data between various applications that run on different platforms • XML (augmented with namespaces) is the preferred syntax for data exchange on the Web http://www.cs.huji.ac.il/~dbi

  9. XML is not Alone • XML Schemas strengthen the data-modeling capabilities of XML (in comparison to XML with only DTDs) • XPath is a language for accessing parts of XML documents • XLink and XPointer support cross-references • XSLT is a language for transforming XML documents into other XML documents (including XHTML, for displaying XML files) • Limited styling of XML can be done with CSS alone • XQuery is a lanaguage for querying XML documents http://www.cs.huji.ac.il/~dbi

  10. The Two Facets of XML • Some XML files are just text documents with tags that denote their structure and include some metadata (e.g., an attribute that gives the name of the person who did the proofreading) • See an example on the next slide • XML is a subset of SGML (Standard Generalized Markup Language) • Other XML documents are similar to database files (e.g., an address book) http://www.cs.huji.ac.il/~dbi

  11. XML can Describethe Structure of a Document <paper> <title> Complexity of Computations </title> <author> <name> M. O. Rabin</name> <institute> Hebrew University </ institute> </author> <abstract> … </abstract> <section> … </section> <section> … </section> <references> … </ references > </paper> http://www.cs.huji.ac.il/~dbi

  12. XML Syntax W3Schools Resources on XML Syntax http://www.cs.huji.ac.il/~dbi

  13. The Structure of XML • XML consists of tags and text • Tags come in pairs<date> ... </date> • They must be properly nested • good <date> ... <day> ... </day> ... </date> • bad <date> ... <day> ... </date>... </day> (You can’t do <i> ... <b> ... </i> ...</b> in HTML) http://www.cs.huji.ac.il/~dbi

  14. A Useful Abbreviation Abbreviating elements with empty contents: • <br/> for <br></br> • <hrwidth=“10”/> for <hrwidth=“10”></hr> For example: <family> <personid = “lisa”> <name> LisaSimpson </name> <motheridref = “marge”/> <fatheridref = “homer”/> </person> ... </family> Note that a tag may have a set of attributes, each consisting of a name and a value http://www.cs.huji.ac.il/~dbi

  15. XML Text XML has only one “basic” type – text It is bounded by tags, e.g., <title>TheBig Sleep</title> <year>1935</ year> – 1935 is still text • XML text is called PCDATA • (for parsed character data) • It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem http://www.cs.huji.ac.il/~dbi

  16. XML Structure • Nesting tags can be used to express various structures, e.g., a tuple (record): <person> <name> Lisa Simpson</name> <tel> 02-828-1234 </tel> <tel> 054-470-777 </tel> <email> lisa@cs.huji.ac.il </email> </person> http://www.cs.huji.ac.il/~dbi

  17. XML Structure (cont’d) • We can represent a list by using the same tag repeatedly: <addresses> <person>… </person> <person>…</person> <person>…</person> <person>…</person> … </addresses> http://www.cs.huji.ac.il/~dbi

  18. XML Structure (cont’d) <addresses> <person> <name> Donald Duck</name> <tel> 04-828-1345 </tel> <email> donald@cs.technion.ac.il </email> </person> <person> <name> Miki Mouse</name> <tel> 03-426-1142 </tel> <email>miki@yahoo.com</email> </person> </addresses> http://www.cs.huji.ac.il/~dbi

  19. element, a sub-element of element not an element Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element <person> <name>Bart Simpson</name> <tel>02 – 444 7777</tel> <tel>051 – 011 022</tel> <email>bart@tau.ac.il</email> </person> http://www.cs.huji.ac.il/~dbi

  20. person name tel tel email An XML Document is a Tree Bart Simpson 051 – 011 022 02 – 444 7777 bart@tau.ac.il Leaves are either empty or contain PCDATA Note that semistructured data models typically put the labels on the edges, and are arbitrary graphs and not just trees http://www.cs.huji.ac.il/~dbi

  21. Mixed Content An element may contain a mixture of sub-elements and PCDATA <airline> <name>British Airways</name> <motto> World’s<dubious>favorite</dubious> airline </motto> </airline> • How many leaves are there in the corresponding tree? • How many leaves are empty? http://www.cs.huji.ac.il/~dbi

  22. The Header Tag • <?xml version="1.0"standalone="yes/no"encoding="UTF-8"?> • Standalone=“no” means that there is an external DTD • You can leave out the encoding attribute and the processor will use the UTF-8 default http://www.cs.huji.ac.il/~dbi

  23. Processing Instructions <?xml version="1.0"?> <?xml-stylesheet  href="doc.xsl" type="text/xsl"?> <!DOCTYPE doc SYSTEM "doc.dtd"> <doc>Hello, world!<!-- Comment 1 --></doc> <?pi-without-data?> <!-- Comment 2 --> <!-- Comment 3 --> http://www.cs.huji.ac.il/~dbi

  24. We want to see the text as is, even though it includes tags Using CDATA <HEAD1> Entering a Kennel Club Member </HEAD1> <DESCRIPTION>Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: </DESCRIPTION> <EXAMPLE><![CDATA[<NAME common="freddy" breed"=springer-spaniel">SirFredrick of Ledyard's End</NAME>]]> </EXAMPLE> http://www.cs.huji.ac.il/~dbi

  25. A Complete XML Document <?XML version ="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE addresses SYSTEM "http://www.cs.huji.ac.il/~dbi/dbi-addresses.dtd"> <addresses> <person> <name>Lisa Simpson</name> <tel> 02-828-1234 </tel> <tel> 054-470-777 </tel> <email> lisa@cs.huji.ac.il </email> </person> </addresses> http://www.cs.huji.ac.il/~dbi

  26. Well-Formed XML Documents • An XML document (with or without a DTD) is well-formed if • Tags are syntactically correct • Every tag has an end tag • Tags are properly nested • There is a root tag • A start tag does not have two occurrences of the same attribute An XML document must be well formed http://www.cs.huji.ac.il/~dbi

  27. DTD(Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs) http://www.cs.huji.ac.il/~dbi

  28. Motivation • A DTD adds syntactical requirements in addition to the well-formed requirement • It helps in eliminating errors when creating or editing XML documents • It clarifies the intended semantics • It simplifies the processing of XML documents http://www.cs.huji.ac.il/~dbi

  29. An Example • In an address book, where can a phone number appear? • Under <person>, under <name> or under both? • If we have to check for all possibilities, processing takes longer and it may not be clear to whom a phone belongs • We would like to know that a phone number is allowed to appear under both a department and the manager of that department • If we don’t know that and there is only one phone number, we may not know whether it serves both the department and its manager or just one of them http://www.cs.huji.ac.il/~dbi

  30. Document Type Definitions • Document Type Definitions (DTDs) impose structure on XML documents • There is some relationship between a DTD and a schema, but it is not close – hence the need for additional “typing” systems (XML schemas) • The DTD is a syntactic specification http://www.cs.huji.ac.il/~dbi

  31. Exactlyonename At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed Example: An Address Book <person> <name>HomerSimpson</name> <greet>Dr. H. Simpson</greet> <addr>1234 Springwater Road</addr> <addr>Springfield USA, 98765</addr> <tel>(321) 786 2543</tel> <fax>(321) 786 2544</fax> <tel>(321) 786 2544</tel> <email>homer@math.springfield.edu</email> </person> http://www.cs.huji.ac.il/~dbi

  32. Specifying the Structure • name to specify a name element • greet? to specify an optional (0 or 1) greet elements • name, greet? to specify a name followed by an optional greet http://www.cs.huji.ac.il/~dbi

  33. Specifying the Structure (cont’d) • addr* to specify 0 or more address lines • tel | fax a telor a fax element • (tel | fax)* 0 or more repeats of tel or fax • email* 0 or more email elements http://www.cs.huji.ac.il/~dbi

  34. Specifying the Structure (cont’d) • So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, email* • This is known as a regular expression • Why is it important? http://www.cs.huji.ac.il/~dbi

  35. Summary of Regular Expressions • A The tag (i.e., element) A occurs • e1,e2 The expression e1 followed by e2 • e* 0 or more occurrences of e • e? Optional: 0 or 1 occurrences • e+ 1 or more occurrences • e1 | e2 either e1 or e2 • (e) grouping http://www.cs.huji.ac.il/~dbi

  36. The Definition of an Element Consists of Exactly One of the Following • A regular expression(as defined earlier) • EMPTY means that the element has not content • ANY means that content can be any mixture of PCDATA and elements defined in the DTD • Mixed content which is defined as described on the next slide • (#PCDATA) http://www.cs.huji.ac.il/~dbi

  37. The Definition of Mixed Content • Mixed content is described by a repeatable OR group (#PCDATA | element-name | …)* • Inside the group, no regular expressions – just element names • #PCDATA must be first followed by 0 or more element names, separated by | • The group can be repeated 0 or more times http://www.cs.huji.ac.il/~dbi

  38. The name of the DTD is addressbook “Internal” means that the DTD and the XML Document are in the same file An Address-Book XML Document with an Internal DTD <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> The syntax of a DTD is not XML syntax http://www.cs.huji.ac.il/~dbi

  39. The Rest of theAddress-Book XML Document • <addressbook> • <person> • <name> Jeff Cohen </name> • <greet> Dr. Cohen </greet> • <email> jc@penny.com </email> • </person> • </addressbook> http://www.cs.huji.ac.il/~dbi

  40. addr name email Regular Expressions • Each regular expression determines a corresponding finite-state automaton • Let’s start with a simpler example: name, addr*, email A double circle denotes an accepting state This suggests a simple parsing program http://www.cs.huji.ac.il/~dbi

  41. address email tel tel name email fax fax email Another Example name,address*,(tel | fax)*,email* Adding in the optional greet further complicates things http://www.cs.huji.ac.il/~dbi

  42. Deterministic Requirement • If element-type declarations are deterministic, it is easier • Formally, the Glushkov automaton is deterministic • The states of this automaton are the positions of the regular expression (semantic actions) • The transitions are based on the “follows set” http://www.cs.huji.ac.il/~dbi

  43. Deterministic Requirement (cont’d) • The associated automata are succinct • A regular language may not have an associated deterministic grammar, e.g., <!ELEMENT ndeter ((movie|director)*,movie,(movie|director))> http://www.cs.huji.ac.il/~dbi

  44. Some Things are Hard to Specify Each employee element should contain name, age and ssn elements in some order <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) | ... )> Suppose that there were many more fields! http://www.cs.huji.ac.il/~dbi

  45. Some Things are Hard to Specify (cont’d) <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) | ... )> Suppose there were many more fields! There are n! different orders of n elements It is not even polynomial http://www.cs.huji.ac.il/~dbi

  46. Specifying Attributes in the DTD <!ELEMENT height (#PCDATA)> <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required The accuracy attribute is optional CDATA is the “type” of the attribute – it means “character data,”and may take any literal string as a value http://www.cs.huji.ac.il/~dbi

  47. The Format of an Attribute Definition • <!ATTLIST element-nameattr-nameattr-typedefault-value> • The default value is given inside quotes http://www.cs.huji.ac.il/~dbi

  48. Summary of Attribute Types • CDATA • (value | … | … ) is an enumeration of allowed values • ID, IDREF, IDRERS • to be explained later • ENTITY, ENTITIES • to be explained later • NMTOKEN, NMTOKENS, NOTATION http://www.cs.huji.ac.il/~dbi

  49. Summary of AttributeDefault Values • #REQUIRED means that the attribute must by included in the element • #IMPLIED • #FIXED “value” • The given value (inside quotes) is the only possible one • “value” • The default value of the attribute if none is given http://www.cs.huji.ac.il/~dbi

  50. Recursive DTDs Each person should have a father and a mother. This leads to either infinite data or a person that is a descendent of herself. <DOCTYPE genealogy [ <!ELEMENT genealogy (person*)> <!ELEMENT person ( name, dateOfBirth, person, -- mother person )> -- father ... ]> What is the problem with this? A parser does not notice it! http://www.cs.huji.ac.il/~dbi

More Related