1 / 40

XML Validation I DTDs

XML Validation I DTDs. Robin Burke ECT 360 Winter 2004. Outline. History Grammars / Regular expressions DTDs elements attributes entities Declarations. Validation. Why bother?. SGML. SGML was designed for text documents Only content of interest = text

greenb
Download Presentation

XML Validation I DTDs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Validation IDTDs Robin Burke ECT 360 Winter 2004

  2. Outline • History • Grammars / Regular expressions • DTDs • elements • attributes • entities • Declarations

  3. Validation • Why bother?

  4. SGML • SGML was designed for text documents • Only content of interest = text • Interest in data types is new (XML) • SGML used existing notation for programming languages • production-based grammar • extended backus-naur form

  5. The idea • Language consists of terminals • a, b, c • Set of productions • beginning with non-terminals • A, B, C • rules specifying how to generate sequences of terminals

  6. Example • A  aB • A  aBA • B  b • generates strings • ababab etc.

  7. Grammar • Can be used to efficiently parse a language • basis of all modern programming language parsing since Algol-60

  8. Java Language Spec • ClassOrInterfaceDeclaration: • ModifiersOpt (ClassDeclaration | InterfaceDeclaration) • InterfaceDeclaration: • interface Identifier [extends TypeList] InterfaceBody • TypeList: • Type { , Type} • InterfaceBody: • { {InterfaceBodyDeclaration} } • InterfaceBodyDeclaration: • (ModifiersOpt InterfaceMemberDecl )* • InterfaceMemberDecl: • InterfaceMethodOrFieldDecl • void Identifier VoidInterfaceMethodDeclaratorRest • ClassOrInterfaceDeclaration • InterfaceMethodOrFieldDecl: • Type Identifier InterfaceMethodOrFieldRest • InterfaceMethodOrFieldRest: • ConstantDeclaratorsRest • InterfaceMethodDeclaratorRest • InterfaceMethodDeclaratorRest: • FormalParameters BracketsOpt [throws QualifiedIdentifierList] ; • VoidInterfaceMethodDeclaratorRest: • FormalParameters [throws QualifiedIdentifierList] ;

  9. Grammar • XML • grammar-based syntax • adheres to EBNF • SGML • SGML had a more complex language definition syntax • HTML is defined the SGML way

  10. Regular expressions • Language for expressing patterns • Basic components • pattern elements • optional element = ? • repetition (1 or more) = + • repetition (0 or more) = * • choice = | • grouping = ( ) • sequence = ,

  11. Examples • (a, b)* • all strings "ab" "abab" etc. • (a | b | c)+, q, (b, c)* • aaqb • bq • bqcccccccc

  12. Note • Regular expressions are different in different applications • Perl • Javascript • XML Schemas • DTDs only support • ?+*|,()

  13. EBNF • EBNF is more compact version of BNF • it uses regular expressions to simplify grammar expression • A  aB • A  aBA • turns into • A  aB(A)? • only one production per non-terminal allowed

  14. DTDs • Use EBNF to specify structure of XML documents • Plus • attributes • entities • Syntax • holdover from SGML • Ugly

  15. DTD Syntax • <!ELEMENT element-namecontent_model> • Content model contains the RHS of the production rule • Example <!ELEMENT name (firstName, lastName)>

  16. DTD Syntax cont'd • Not XML • <! begins a declaration • No "content" • Empty elements not indicated

  17. Some special cases • Content can be any text • #PCDATA • Content can be anything at all • (useful for debugging) • ANY • Element has no content • EMPTY

  18. Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> </grades>

  19. Example <grades> <grade> <student>Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <grade> <student>John Doe</student> <assigned-grade>A-</assigned-grade> </grade> <grade> <student>Wayne Doe</student> <assigned-grade>I</assigned-grade> <reason>Alien abduction</reason> </grade> </grades>

  20. Mixed content • Legal to have a content model with text and element data <story category="national" byline="Karen Wheatley"> <headline>President Meets with Congress</headline> <![CDATA[ The President meet with Congressional leaders today in effort to jump-start faltering budget negotiations. Sources described the mood of the meeting as "cordial". ]]> <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story>

  21. CDATA? • Forgot to mention last week • Content that appears here will not be parsed • Can include arbitrary text including <, &, etc. • Only restriction • termination sequence • ]]>

  22. Mixed content, cont'd • <!ELEMENT story (headline, #PCDATA, full-story, image*)> • Mixed content is usually discouraged • Makes transformations more difficult

  23. Recursion • Unlike grammars • recursive formulation ≠ repetition • Difference between • <!ELEMENT students (student+)> • <!ELEMENT students (student, students?)>

  24. Restriction • The grammar cannot be ambiguous • A  (a, b)| (a, c) • this makes the parser implementation difficult • Usually easy to make non-ambiguous • A  a, (b | c)

  25. Attribute lists • Declared separately from elements • Specification includes • name of the element • name of the attribute • attribute type • default

  26. Attribute types • Character data • CDATA • different from XML CDATA section! • Enumerated • (yes|no) • ID • must be unique in the document • IDREF • must refer to an id in the document • NMTOKEN • a restriction of CDATA to single "word" • Also IDREFS and NMTOKENS

  27. Default declaration • #REQUIRED • #IMPLIED • means optional • Value • this becomes the default • #FIXED • value provided

  28. Examples <!ATTLIST img src CDATA #REQUIRED alt CDATA #REQUIRED align (left|right|center) "left" id ID #IMPLIED > <!ATTLIST timestamp time-zone NMTOKEN #IMPLIED>

  29. Entities • Like macros • content to be inserted • indicated with &name; • Predefined general entities • &amp; &lt; • essential part of XML • User-defined general entities • &disclaimer;

  30. Entities, cont'd • Parameter entities • can also be used to simplify DTD creation • or to combine DTDs • indicated with a % • Example from book • %Books; • %Mags;

  31. Defining entities • General entities • <!ENTITY name content> • Example <!ENTITY disclaimer "This is a work of fiction. Any resemblance to persons living or dead is unintentional.">

  32. Defining entities, cont'd • Parameter entities <!ENTITY % name content> • or more typical <!ENTITY % name SYSTEM url>

  33. Example

  34. Unparsed data • What about non-text data? • images, audio files • In XML • we define a notation • create a name and associate an application • suggestion to the application • how to interpret the unparsed data • not part of parsing operation

  35. Using Notation • <!NOTATION name SYSTEM url> • Example • <!NOTATION jpeg SYSTEM • "IExplore.exe"> • declares the jpeg notation • Example • <!ENTITY "photo53" SYSTEM "photo53.jpg" NDATA jpeg>

  36. Notation, cont'd • Note that the content is defined in the DTD • not the document • binary data embedded in XML document • Not that useful in practice • more likely

  37. Typical Example <story category="national" byline="Karen Wheatley"> ... <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story> • Now it is up to the application to do something appropriate with the src attribute

  38. A better solution • Use XLink

  39. DTD limitations • Not in XML • need a special parser for the DTD • No content type restrictions • #PCDATA can be anything • Element names must be globally unique • cannot reuse a common term at different places in the document • course-name • professor-name

  40. Lab

More Related