Processing Structured Documents Course - Spring 2001

Processing of structured documents Spring 2001 Helena Ahonen-Myka

Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 27.2.-5.4. Tue 12-14, Thu 10-12 • exceptions: no lectures 6. and 8.3. • exercise sessions • 6.3.-5.4. Tue 10-12 A318 (in English?), Thu 12-14 C454 (in Finnish; 22.3. at 8-10) • course assistant: Olli Lahti • not obligatory

Project work • an XML application that is constructed during the course • a framework is given in the first lecture • in connection with the exercises, more requirements are given • a report has to be returned by 12.4.

Requirements • Exam (Wed 11.4. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60

Outline (preliminary) • 1. Introduction • 2. Descriptions of structure • context-free grammars • XML DTD, XML Schema • 3. Programming interfaces • SAX, DOM • 4. Querying structured documents • XML Query

Outline... • 5. Transforming structured documents • XSL (XSLT, formatting objects) • presentation issues • 6. Document architectures • 7. Metadata: RDF • 8. Compressing XML data • 9. ...

1. Introduction

Structured documents • Document? • A structured representation of (textual) information on some medium • normally for a human reader • messages, manuals, memos, books… • also to/from/between applications • source code, program-generated mail, EDI (electronic data interchange) • static - dynamic

Presentation and structure • Presentation informs the human reader about the meaning of text and the role of its parts • markup: indicating the presentation or the meaning of different parts of text • originally hand-written annotations for the typesetter • nowadays primarily codes embedded in digital documents

Markup • Procedural markup • formatting commands (start boldface, produce an empty line, indent 5mm…) • Descriptive markup • indicating the logical structure of text using chosen names

Structured documents? • Generally speaking any text is structured (punctuation, words, sentences…) • but especially descriptively marked-up documents… • especially if they adhere to a rigorous specification of structure.

”Document”: <memo importance=”high” date=”19990323”> <from>Paul V. Biron</from> <to>Ashok Malhotra</to> <subject>Latest draft</subject> <body> We need to discuss the latest draft <emph>immediately</emph>. Either email me at <email> mailto:paul.v.biron@kp.org</email> or call <phone>555-9876</phone> </body> </memo>

”Data”: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>

<body> Order date: 19990121 Shipping date: 19990125 Address: <table> <tr><th>name<th>street<th>city<th>state<th>zip <tr><td>Ashok Malhotra <td>123 IBM Ave. <td>Hawthorne <td>NY <td>10532-0000 </table> Phone: 555-1234 Fax: 555-4321 </body>

Theses of structured documenting • Separation of structure and presentation • markup of structure and other (meta) information should be done • at creation time • for future needs • rigor of markup • automatization of processing

Advantages of structure • Better control over documents • guidance of writing, validation of structure • higher-precision retrieval (conditions for parts) • reuse of information • automated processing • control of uniform style

Advantages of structure • Transport of documents between different environments and applications • archival of documents • storing in databases • multiuse of documents • different layout styles • paper, online, CD-ROM, pda • different versions

Disadvantages of structure • Start-up costs • design of document structures • conversion of legacy (non-structured) documents • implementation/adaptation of tools, procedures and policies • attitudes of authors • from a producer of a final publication to an information-feeding clerk?

2. Project work • The goal: everyone builds a (non-trivial) XML application that can be used during the course to train different concepts and methods • Example: I would need a system to track the work of my Master’s thesis students

A wish list: • I want to store information about my students, e.g., name, contact information, scheduled meetings and deadlines, comments, problems, ”deals”, links to the drafts and the homepages of the students, etc. • As a primary interface I’d like to have a web page (with forms)

A wish list: functions • I want to add information using the HTML form on the web page (easily!) • I want to have a listing on the web page of 1) all the students 2) information about one student • I need also other listings (e.g. simple ASCII) for reporting the state of my students (or just a list of my current students)

And now you... • Design an application that is somehow ”similar” to mine • set of persons (or other objects) with information (e.g. your customer contacts) • some parts free text • several different ways to use the data, e.g. several listings (both content and presentation)

Requirements • More requirements follow later... • return a report by 12.4. • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system

3. Structure descriptions • Regular expressions, context-free grammars • XML Document type definitions • XML Schema

Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components

Regular expressions • A regular expression over alphabet  is either •  (an empty set) •  epsilon; sometimes lambda ) • a, where a   • R | S (choice; sometimes R  S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions

Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() =  (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a  ) • L(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} • L(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} • L(R*) = L(R)* = {x1…xn| xk  L(R), k=1,…,n; n  0}

Example • top-level structure of a document: •  = {title, author, date, sect) • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+

Context-free grammars • Used widely to syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V =   N •  : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S  N: the start symbol

Productions and derivations • Productions: A -> , where A  N,   V* • e.g. A -> aBa (1) • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A ->  is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)

Language generated by a context-free grammar •  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  • The language generated by a CFG G: • L(G) = {w  * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees

Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production

CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A -> E is a production such that   L(E) • e.g. Ref => Author Author Author Title PublData

Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same

Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk  L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

What is XML? • W3C Recommendation Feb 1998 • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to the ECFG are strings in the language • document types (grammars)- document instances (strings in the language)

XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)

Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in proposed XML schema formalisms

XML: logical structure • Elements • correspond to internal nodes of the parse tree • unique root element -> document is a single parse tree • indicated by matching (case-sensitive!) tags <ElementTypeName>…</ElementTypeName> • can contain text and/or subelements • can be empty: • <elem-type></elem-type> •

Logical structure • Attributes • name-value pairs attached to elements • ”metadata”, usually not treated as content • e.g. <div class=”preface” date=”990126”> • also: •  • <?note this text would be passed to the application as a processing instruction named ’note’?>

Document type declaration • Provides a grammar (document type definition, DTD) for a class of documents • syntax: • <!DOCTYPE root-type-name SYSTEM ”ex.dtd”  [  ]> • external and internal subset make up the DTD; internal has higher precedence

XML declaration • <?xml version=”1.0” encoding=”UTF-8” standalone=”yes” ?>

Defining the structure: DTD • document type definition (DTD) • content model for each element • describes how the elements are formed from the other elements and text • defines which attributes an element may/must have; default values • content models are regular expressions

Markup declarations • Element type declarations (similar to productions of ECFGs) • attribute-list declarations (for declared element types) • entity declarations • notation declarations

Element type declarations • The general form is • <!ELEMENT elem-type-name (E)> • where E is a content model = regular expression over element names

Regular expression syntax • + : 1 or more • * : 0 or more • ? : 0 or 1 • | : choice (one has to be chosen) • () : grouping • , : order

Examples of definitions • <!ELEMENT name (fname+, lname)> • <!ELEMENT address (name, street, (city, state, zipcode) | (zipcode, city))> • <!ELEMENT contact (address, phone*, email?)> • <!ELEMENT contact2 (address | phone | email)*>

DTD for the Invoice example <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>

Attribute-list declarations • Name, data type and possible default value for each attribute for a given element type • Example: • <!ATTLIST FIG • id ID #IMPLIED • descr CDATA #REQUIRED • class (a | b | c) ”a”> • semantics mainly up to the application

Processing Structured Documents Course - Spring 2001

Processing Structured Documents Course - Spring 2001

Presentation Transcript

Processing XML Documents

Processing of structured documents

Processing of structured documents

Processing of structured documents

Processing of structured documents

Metadata, Structured Documents, and XML

Processing of structured documents

Processing of structured documents

Processing of structured documents

Structured documents and (X)HTML

Structured Documents

Processing of structured documents

Structured Documents

Processing of structured documents

Structured -Document Processing Languages

Structured Documents: An Introduction

Processing of structured documents

Processing of structured documents

Processing of structured documents

Processing of structured documents