1 / 47

Use of XML in the Publications Office: Critical issues for publishing

Use of XML in the Publications Office: Critical issues for publishing. Dr. Holger Bagola Publications Office DIR/R 5 “IT Projects” section “ Formats & Linguistic Informatics ” October 2006. History From SGML to XML Structure of publications in Formex Streamlining of models

chubb
Download Presentation

Use of XML in the Publications Office: Critical issues for publishing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Use of XML in the Publications Office:Critical issues for publishing Dr. Holger Bagola Publications Office DIR/R 5 “IT Projects” section “Formats & Linguistic Informatics” October 2006

  2. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  3. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  4. History • In the 70ies more and more publication procedures were supported by computer applications. • No common standard for applications in the context of publishing Publishing houses were confronted by a large variety of formats. Use of XML in the Publications Office

  5. History • A considerable amount of documents published in the Official Journal can be totally of partially re-used for the publications of other documents. As the electronic formats of published documents were not standardized, it was impossible to install convenient procedures. Use of XML in the Publications Office

  6. History • First information published on SGML as a future standard for the exchange of documents in the early 80ies • Main advantages of the approach: • Independence from any application or operating platform • Description of logical document structure instead of presentation Use of XML in the Publications Office

  7. History In 1982 the Publications Office decided to define a format for the exchange of published documents: Formex (Format for the exchange of electronic publications). Use of XML in the Publications Office

  8. History • Publication of Formex specifications in 1984/1985 • Formex part of the framework contract for OJ publications in 1985 • 1986: Adoption of the SGML standard by ISO (ISO 8879) Use of XML in the Publications Office

  9. History BUT . . . There was not a real support of the format on the market (parsers, editors, etc.). The approach seemed to be rather exotic for printing houses which were used to the presentation of documents. The quality of delivered SGML documents was rather poor. Use of XML in the Publications Office

  10. History • Revision and partial redesign of Formex • Addition of a basic table model Formex 2 was easier to understand by the framework contractors. Better quality, but still insufficient for publication: impossible to derive the document presentation from the rough description of the document structure. Use of XML in the Publications Office

  11. History • Total redesign of Formex specifications • Implementation of more flexible table model • Integration of metadata into the SGML document structure • Finer granularity and distinct elements for description of document structure (possibility of deriving presentation from structure Use of XML in the Publications Office

  12. History Rather complex specification which needed an intensive validation of the deliveries. Use of XML in the Publications Office

  13. History • Since 1998: XML as a new, but compatible standard was adopted by W3C. • XML was immediately accompanied by additional standards which supported the navigation and transformation of documents. • A new standard for the specification of XML grammars was adopted in 2001: XML Schema Use of XML in the Publications Office

  14. History • In 2001 the Publications Office organized a Formex user meeting to discuss about future development of the approach. The main results of this meeting were: • Migration to XML for which various tools were on the market (partly as open source) • Replacement of the DTD methodology for specifying XML grammars by XML Schema Use of XML in the Publications Office

  15. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  16. From SGML to XML • Revision of approach in order to define a grammar which meets the needs of printing houses without abandoning the description of the logical document structure • Definition of a table model based on the HTML model (keeping logical relations and functions in attributes) Use of XML in the Publications Office

  17. From SGML to XML • Abandon of parallel models: distinction made by context analysis • Replacement of character encoding based on ISO 2022 by Unicode (UTF-8, the default for XML instances) • All documents contain a reference to the Formex schema on the web:http://formex.publications.europa.eu Use of XML in the Publications Office

  18. From SGML to XML • Distinction of up to four levels of a publication • Definition of rules for automatic validation of Formex instances beyond parsing • Development of a comparison tool for the contents of Formex instances with corresponding PDF instances • Automatic extraction of metadata for updating of EUR-Lex Use of XML in the Publications Office

  19. From SGML to XML • The XML based version of the Formex 4 specifications entered into force on May 1st,2004. • The current release is 3.00. Use of XML in the Publications Office

  20. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  21. Structure of publications in Formex • Formex instances concern OJ publications only (L and C series) • Other publications are possible, but currently not realized Use of XML in the Publications Office

  22. Structure of publications in Formex • Description of publication structure: • Description of structure and composition of the publication stricto sensu • Description of structure and composition of a document • Contents of document and sub-documents • Non-XML parts or fragments of documents Use of XML in the Publications Office

  23. Structure of publications in Formex Publication Descriptionof logicalstructure and composi-tion Referencesto documents Document Referencesto main andsub-docu-ments Maindocument Non-XMLinstance Sub-document Non-XMLinstance Sub-document Non-XMLinstance Document Referencesto main andsub-docu-ments Maindocument Use of XML in the Publications Office

  24. Structure of publications in Formex • In order to keep a minimum of metadata information together with the contents of a document some of the corresponding items are present on various levels. • All sub-levels contain references to the superior hierarchical level (except for non-XML instances). Use of XML in the Publications Office

  25. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  26. Streamlining of models • Whenever a Formex 3 element could appear in various contexts distinct elements were created. Thus there were parallel models such as TI.DOC, TI.ANNEX, TI.GRSEQ etc. These elements were grouped together, the context expressing the distinct functions. Use of XML in the Publications Office

  27. Old ACT/TI.DOC ANNEX/TI.ANNEX GR.SEQ/TI.GRSEQ New ACT/TITLE ANNEX/TITLE GR.SEQ/TITLE TITLE[parent::ACT] TITLE[parent::ANNEX] TITLE[parent::GR.SEQ] Streamlining of models Use of XML in the Publications Office

  28. Streamlining of models Old table model • The table model in Formex 1-3 was a logical one, distinguishing between the column and line headings and the body. • The body could easily be identified and copied to another linguistic version. Use of XML in the Publications Office

  29. Streamlining of models Old table model • Empty cells were not present in old instances. • Attributes expressed the relation between cells and columns. Use of XML in the Publications Office

  30. Streamlining of models New table model • Top-down model for headings and body. • Attributes express the distinct function of a specific cell. • Empty cells are present containing a special attribute which explicitely confirms the absence of any contents. Use of XML in the Publications Office

  31. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  32. Current status of Formex • Formex 4 is totally W3C Schema based. • It is in use since May 2004. • Minor changes were integrated (release 3.0) • All OJ (L and C) documents are covered. • Further document types (not published in OJ) will be taken into account. Use of XML in the Publications Office

  33. Current status of Formex • Specification, documentation of all elements, physical specification, examples (> 600) publicly available on web-site:http://formex.publications.europa.eu Use of XML in the Publications Office

  34. Current status of Formex • Availability of Formex via the LegisWrite interface • XML instances are not (yet?) publicly accessible • Different quality levels according to validation Use of XML in the Publications Office

  35. Current status of Formex Printing house CERES Quality 1 Quality 2 Quality 3 EUDOR Automaticvalidation Manualvalidation Conversionto LW LegisWriteInterface Client Client Use of XML in the Publications Office

  36. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  37. Particular needs for publishing • Publishing mostly concerns the presentation of documents in a readable form. • A “good” logical XML model allows for the derivation of the presentation of a given document. • Printing houses are obliged to work with Formex instances along the production processes. Use of XML in the Publications Office

  38. Particular needs for publishing • Some parts of a document (words, parts of a sentence) require a specific presentation which is not always logical. • Specific elements for text highlighting and presentation had to be created. Ex. Foreign words in some language versions in italics. Use of XML in the Publications Office

  39. Particular needs for publishing • Quotation marks differ from one language version to the other. • Exceptions for the use on nested levels require the presence of the specific symbols. Use of XML in the Publications Office

  40. Particular needs for publishing • For special cases the printing houses are allowed to use temporary additional markup (processing instructions, elements from other namespaces). • In most cases this kind of information depends on the publishing system. Use of XML in the Publications Office

  41. Particular needs for publishing • All this additional information has to be deleted before sending the electronic version of the publication. • For the design of new elements the relation to presentation has to be analyzed. • In most cases it has to be assured to guarantee the correct identification of the new element. Use of XML in the Publications Office

  42. Particular needs for publishing • Conversion into other electronic formats requires similar measures. • Regular derivations are • Presentation in the Official Journal • Presentation in LegisWrite • Presentation in HTML Use of XML in the Publications Office

  43. Particular needs for publishing Formex(XML) instance Format “Official Journal”(PDF) Format “LegisWrite”(RTF) Format “EUR-Lex”(HTML) Use of XML in the Publications Office

  44. History • From SGML to XML • Structure of publications in Formex • Streamlining of models • Current status of Formex • Particular needs for publishing • Conclusion Use of XML in the Publications Office

  45. Conclusion • Since the beginnings Formex is a common exchange format which is independent from any application or platform. • Clear character encoding in all versions Use of XML in the Publications Office

  46. Conclusion • Availability of tools on the market for XML based instances: • RXP for validating DTD parsing • XSV for validating XML Schema parsing • XMLSpy for development (+ Saxon) • XMetal for content editing • renderX for generation of PDF Use of XML in the Publications Office

  47. Conclusion • Stylesheets (based XSL FO) for presentation • Future enhancements: • Better integration of other source formats (RTF/LegisWrite) • Addition of other document types not necessarily related to the Official Journal Use of XML in the Publications Office

More Related