1 / 74

Processing of structured documents

Processing of structured documents. Spring 2001 Helena Ahonen-Myka. Course organization. 581290-5 laudatur course, 3 cu lectures (in Finnish) 27.2.-5.4. Tue 12-14, Thu 10-12 exceptions: no lectures 6. and 8.3. exercise sessions

poseye
Download Presentation

Processing of structured documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processing of structured documents Spring 2001 Helena Ahonen-Myka

  2. Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 27.2.-5.4. Tue 12-14, Thu 10-12 • exceptions: no lectures 6. and 8.3. • exercise sessions • 6.3.-5.4. Tue 10-12 A318 (in English?), Thu 12-14 C454 (in Finnish; 22.3. at 8-10) • course assistant: Olli Lahti • not obligatory

  3. Project work • an XML application that is constructed during the course • a framework is given in the first lecture • in connection with the exercises, more requirements are given • a report has to be returned by 12.4.

  4. Requirements • Exam (Wed 11.4. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60

  5. Outline (preliminary) • 1. Introduction • 2. Descriptions of structure • context-free grammars • XML DTD, XML Schema • 3. Programming interfaces • SAX, DOM • 4. Querying structured documents • XML Query

  6. Outline... • 5. Transforming structured documents • XSL (XSLT, formatting objects) • presentation issues • 6. Document architectures • 7. Metadata: RDF • 8. Compressing XML data • 9. ...

  7. 1. Introduction

  8. Structured documents • Document? • A structured representation of (textual) information on some medium • normally for a human reader • messages, manuals, memos, books… • also to/from/between applications • source code, program-generated mail, EDI (electronic data interchange) • static - dynamic

  9. Presentation and structure • Presentation informs the human reader about the meaning of text and the role of its parts • markup: indicating the presentation or the meaning of different parts of text • originally hand-written annotations for the typesetter • nowadays primarily codes embedded in digital documents

  10. Markup • Procedural markup • formatting commands (start boldface, produce an empty line, indent 5mm…) • Descriptive markup • indicating the logical structure of text using chosen names

  11. Structured documents? • Generally speaking any text is structured (punctuation, words, sentences…) • but especially descriptively marked-up documents… • especially if they adhere to a rigorous specification of structure.

  12. ”Document”: <memo importance=”high” date=”19990323”> <from>Paul V. Biron</from> <to>Ashok Malhotra</to> <subject>Latest draft</subject> <body> We need to discuss the latest draft <emph>immediately</emph>. Either email me at <email> mailto:paul.v.biron@kp.org</email> or call <phone>555-9876</phone> </body> </memo>

  13. ”Data”: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>

  14. <body> <p><b>Order date:</b> 19990121</p> <p><b>Shipping date:</b> 19990125</p> <p><b>Address:</b></p> <table> <tr><th>name<th>street<th>city<th>state<th>zip <tr><td>Ashok Malhotra <td>123 IBM Ave. <td>Hawthorne <td>NY <td>10532-0000 </table> <p>Phone: 555-1234</p> <p>Fax: 555-4321</p> </body>

  15. Theses of structured documenting • Separation of structure and presentation • markup of structure and other (meta) information should be done • at creation time • for future needs • rigor of markup • automatization of processing

  16. Advantages of structure • Better control over documents • guidance of writing, validation of structure • higher-precision retrieval (conditions for parts) • reuse of information • automated processing • control of uniform style

  17. Advantages of structure • Transport of documents between different environments and applications • archival of documents • storing in databases • multiuse of documents • different layout styles • paper, online, CD-ROM, pda • different versions

  18. Disadvantages of structure • Start-up costs • design of document structures • conversion of legacy (non-structured) documents • implementation/adaptation of tools, procedures and policies • attitudes of authors • from a producer of a final publication to an information-feeding clerk?

  19. 2. Project work • The goal: everyone builds a (non-trivial) XML application that can be used during the course to train different concepts and methods • Example: I would need a system to track the work of my Master’s thesis students

  20. A wish list: • I want to store information about my students, e.g., name, contact information, scheduled meetings and deadlines, comments, problems, ”deals”, links to the drafts and the homepages of the students, etc. • As a primary interface I’d like to have a web page (with forms)

  21. A wish list: functions • I want to add information using the HTML form on the web page (easily!) • I want to have a listing on the web page of 1) all the students 2) information about one student • I need also other listings (e.g. simple ASCII) for reporting the state of my students (or just a list of my current students)

  22. And now you... • Design an application that is somehow ”similar” to mine • set of persons (or other objects) with information (e.g. your customer contacts) • some parts free text • several different ways to use the data, e.g. several listings (both content and presentation)

  23. Requirements • More requirements follow later... • return a report by 12.4. • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system

  24. 3. Structure descriptions • Regular expressions, context-free grammars • XML Document type definitions • XML Schema

  25. Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components

  26. Regular expressions • A regular expression over alphabet  is either •  (an empty set) •  epsilon; sometimes lambda ) • a, where a   • R | S (choice; sometimes R  S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions

  27. Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() =  (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a  ) • L(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} • L(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} • L(R*) = L(R)* = {x1…xn| xk  L(R), k=1,…,n; n  0}

  28. Example • top-level structure of a document: •  = {title, author, date, sect) • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+

  29. Context-free grammars • Used widely to syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V =   N •  : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S  N: the start symbol

  30. Productions and derivations • Productions: A -> , where A  N,   V* • e.g. A -> aBa (1) • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A ->  is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)

  31. Language generated by a context-free grammar •  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  • The language generated by a CFG G: • L(G) = {w  * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees

  32. Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production

  33. CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

  34. Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A -> E is a production such that   L(E) • e.g. Ref => Author Author Author Title PublData

  35. Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same

  36. Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk  L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

  37. What is XML? • W3C Recommendation Feb 1998 • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to the ECFG are strings in the language • document types (grammars)- document instances (strings in the language)

  38. XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)

  39. Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in proposed XML schema formalisms

  40. XML: logical structure • Elements • correspond to internal nodes of the parse tree • unique root element -> document is a single parse tree • indicated by matching (case-sensitive!) tags <ElementTypeName>…</ElementTypeName> • can contain text and/or subelements • can be empty: • <elem-type></elem-type> • <br />

  41. Logical structure • Attributes • name-value pairs attached to elements • ”metadata”, usually not treated as content • e.g. <div class=”preface” date=”990126”> • also: • <!-- comments --> • <?note this text would be passed to the application as a processing instruction named ’note’?>

  42. Document type declaration • Provides a grammar (document type definition, DTD) for a class of documents • syntax: • <!DOCTYPE root-type-name SYSTEM ”ex.dtd” <!-- external subset in file ex.dtd --> [ <!-- internal subset may come here --> ]> • external and internal subset make up the DTD; internal has higher precedence

  43. XML declaration • <?xml version=”1.0” encoding=”UTF-8” standalone=”yes” ?>

  44. Defining the structure: DTD • document type definition (DTD) • content model for each element • describes how the elements are formed from the other elements and text • defines which attributes an element may/must have; default values • content models are regular expressions

  45. Markup declarations • Element type declarations (similar to productions of ECFGs) • attribute-list declarations (for declared element types) • entity declarations • notation declarations

  46. Element type declarations • The general form is • <!ELEMENT elem-type-name (E)> • where E is a content model = regular expression over element names

  47. Regular expression syntax • + : 1 or more • * : 0 or more • ? : 0 or 1 • | : choice (one has to be chosen) • () : grouping • , : order

  48. Examples of definitions • <!ELEMENT name (fname+, lname)> • <!ELEMENT address (name, street, (city, state, zipcode) | (zipcode, city))> • <!ELEMENT contact (address, phone*, email?)> • <!ELEMENT contact2 (address | phone | email)*>

  49. DTD for the Invoice example <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>

  50. Attribute-list declarations • Name, data type and possible default value for each attribute for a given element type • Example: • <!ATTLIST FIG • id ID #IMPLIED • descr CDATA #REQUIRED • class (a | b | c) ”a”> • semantics mainly up to the application

More Related