1 / 50

XML for Information Management

XML for Information Management. 12.1.-16.1. 2009. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Day 1: Course introduction, XML examples and concepts. Outline. 1. Course introduction 2. XML examples

meena
Download Presentation

XML for Information Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML for Information Management 12.1.-16.1. 2009 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/

  2. Day 1: Course introduction, XML examples and concepts Outline • 1. Course introduction • 2. XML examples • 3. XML concepts

  3. 1. Course introduction: Instructor • Home university: University of Jyväskylä in Finland, Faculty of Information Technology • Home page: http://users.jyu.fi/~airi/ • Experience Jyväskylä: • http://www3.jkl.fi/international/experience/index.html 3

  4. 1. Course introduction: Instructor • My research areas: structured documents, content management in organizations, document standardization, semantic web, information retrieval • My XML-related research has concerned: • modelling structured text • querying structured text • SGML/XML standardization 4

  5. 1. Course introduction: Instructor Tague, J., Salminen, A., & McClellan, C. (1991). Complete formal model for information retrieval systems. In Proc. of the 14thACM SIGIR Conference, 14-20. New York: ACM Press. Salminen, A., & Watters, C. (1992). A two-level structure for textual databases to support hypertext access. Journal of the American Society for Information Science 43 (6), 432-447. Salminen, A., & Tompa, F. (1993). PAT expressions: an algebra for text search Acta Linguistica Hungarica, 41 (1-4), 277-306. http://www.cs.jyu.fi/~airi/papers/COMPLEX-1992.pdf Salminen, A., Tague-Sutcliffe, J., & McClellan, C. (1995). From text to hypertext by indexing. ACM Transactions on Information Systems 13 (1), 69-99. Salminen, A., Lehtovaara, M., & Kauppinen, K. (1996). Standardization of digital legislative documents - a case study. In Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences (pp. 72-81). Los Alamitos, CA: IEEE Computer Society Press. Kuikka, E., & Salminen, A. (1997). Two-dimensional filters for structured text. Information Processing and Management 33 (1), 37-54. 5

  6. 1. Course introduction: Instructor Salminen, A., Kauppinen, K., & Lehtovaara, M. (1997). Towards a methodology for document analysis. Journal of the American Society for Information Science 48 (7), Special Issue on Structured Information/Standards for Document Architectures, 644-655.  Salminen, A., & Tompa, F. (1999). Grammars++ for modelling information in text. Information Systems 24 (1), 1-24. Salminen, A., Tiitinen, P., & Lyytikäinen, V. (1999). Usability evaluation of a structured document archive. In Proc. of the Thirty-Second Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. Lyytikäinen, V., Tiitinen, P., & Salminen, A. (2001). XML metadata for accessing heterogeneous legal databases. In Proc. of the XML Europe 2001 Conference.http://www.gca.org/papers/xmleurope2001/papers/html/s27-4.html Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. In Proc. of the ACM Symposium on Document Engineering (DocEng '01), 85-94. New York: ACM Press. Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2001). Experiences of SGML standardization: The case of the Finnish legislative documents. In Proc. of the Thirty-Fourth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. 6

  7. 1. Course introduction: Instructor Salminen, A. (2003). Document analysis methods. Encyclopedia of Library and Information Science, Second Edition, Revised and Expanded (pp. 916-927). New York: Marcel Dekker. New York: ACM Press. Korhonen, R. & Salminen, A. (2003). Visualization of EDI messages: Facing the problems in the use of XML. In Proc. of the Fifth International Conference on Electronic Commerce, 466-473. New York: ACM Press. Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2004). Implementing digital government in the Finnish Parliament. In Digital Government: Strategies and Implementation (pp. 242-259). Hersley, PA: IDEA Group Publishing Salminen, A. (2005). Building digital government by XML. In Proc. of the Thirty-Eighth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press. Salminen, A., Nurmeksela, R., Lehtinen, A., Lyytikäinen, V., & Mustajärvi, O. (2006). Content production strategies for e-Government. In Encyclopedia of Digital Government, Vol. I (pp. 224-230). Hersley, PA: IDEA Group Publishing. Nurmeksela, R., Jauhiainen, E., Salminen, A., & Honkaranta, A. (2007). XML document implementation: Experiences from three cases. In Proceedings of the Second International Conference on Digial Information Management (pp. 224-229). Los Alamitos, CA: IEEE. 7

  8. 1. Course introduction: Instructor XML-related projects • RASKE (1994-1998): Developing Standards for Structured Documents • inSGML (1998-2001): Methods for SGML standardization in industry • EULEGIS (1998-2000): European User Views to Legislative Information in Structured Form • AirXML (2002-2004): XML and Data Warehousing in Air Defence • RASKE2 (2003-2006): Methods for the Integration of Systems and Services in e-Government 8

  9. 1. Course introduction • Syllabus: • http://users.jyu.fi/~airi/opetus/xml/erlangen/ • Course Readings: • available on the course web site • Project Assignment: • http://users.jyu.fi/~airi/opetus/xml/erlangen/project.html • Contact by email: airi.salminen@jyu.fi 9

  10. 1. Course introduction: project • Purpose • The projects are intended to explore the application of XML in various contexts. Students interested in practical XML exercises are free to suggest a practical project where they can test some XML software and/or build an application of their own. • The project can also be an investigation of an existing or planned XML solution in an organizational context together with an analysis of the impacts of the solution. • Topics: Proposed by students • Teams of two, or individual projects • The phases • 2 page topic proposal: due on Feb. 20 • Project report: due on March 31 10

  11. 2. XML examples • separation of the primary content and markup • markup is metadata adding some information to the primary content <?xml version = "1.0"?> <poem author = ”Murasaki Shikibu” author_born = ”974”> <stanza> <line>This life of ours would not cause you sorrow</line> <line>if you thought of it as like</line> <line>the mountain cherry blossoms</line> <line>which bloom and fade in a day.</line> </stanza> </poem> Note: The text of theline elements is taken from http://www.bopsecrets.org/rexroth/translations/japanese.htm, containing Kenneth Rexroth’s translations of Japanese poetry

  12. 2. XML examples External presentation for human perception can be defined in a separate stylesheet. By a proper stylesheet the previous XML document might look like: This life of ours would not cause you sorrow if you thought of it as like the mountain cherry blossoms which bloom and fade in a day. Examples of the attachment of stylesheets. Try ”xml examples” by Google.

  13. 2. XML examples A piece of prose in the TEI Guidelines: http://www.tei-c.org/Guidelines/Customization/Lite/U5-eg.html

  14. 3. XML concepts XML = Extensible Markup Language A set of rules for defining and representing information as structured documents for applications on the Internet. XML is a restricted form of the older markup language called SGML. T. Bray, J. Paoli, & C. M. Sperberg-McQueen (Eds.), Extensible Markup Language (XML) 1.0, W3C Recommendation 10- February-1998, http://www.w3.org/TR/1998/REC-xml-19980210/ T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, & F. Yergeau (Eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation 16 August 2006, http://www.w3.org/TR/2008/REC-xml-20081126/ T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, & J. Cowan (Eds.), Extensible Markup Language (XML) 1.1. (Second Edition) W3C Recommendation 16 August 2006. http://www.w3.org/TR/2006/REC-xml11-20060816/ XML Development History: http://www.w3.org/XML/hist2002

  15. 3. XML concepts: XML processor Processing XML documents XML Document XML Processor Application

  16. 3. XML concepts: physical and logical structure XML processor recognizes from a document two structures: • physical structure, consisting of entities • logical structure where elements are the core composites

  17. 3. XML concepts: entity Entity • file (text or some other kind of data) • named piece of text

  18. 3. XML concepts: entity Example of an entity structure root entity part 1 part 2 figure1.jpg figure2.jpg figure3.jpg entity entity reference

  19. 3. XML concepts: entity Entity as a named piece of text, like in HTML: Y&ouml; Jyv&auml;skyl&auml;ss&auml; Yö Jyväskylässä

  20. 3. XML concepts: element Element An element is marked-up by a begin-tag and an end-tag. <year>1654</year> end-tag begin-tag content

  21. 3. XML concepts: element Example 1: a document of seven elements <?xml version="1.0"?> <rhymecollection> <rhyme> <line>Ole aina iloinen</line> <line> niin kuin pikku varpunen</line> </rhyme> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection>

  22. 3. XML concepts: tree structure Example 1 as an element tree root element rhymecollection rhyme rhyme line line line line • There is always one root element • Every non-root element is a child element of a parent element

  23. 3. XML concepts: attribute Extra information can be attached to elements by attributes An attribute has: • name • value (character string) <lastname earlier=“Rantanen”>Korhonen</lastname> name value Two predefined attributes: xml:lang and xml:space. xml:lang for identifying the language of the content of an element xml:space for signaling that the white spaces should be preserved by the application

  24. 3. XML concepts: elements and attributes Data in XML elements: • as element content • as attribute value

  25. 3. XML concepts: elements and attributes Three alternative ways for giving two lastnames for a person: <lastname earlier=“Rantanen”>Korhonen</lastname> 1. 2. <lastname> <earlier>Rantanen</earlier> <now>Korhonen </now> </lastname> 3. <lastname earlier=“Rantanen” now=“Korhonen”> </lastname> What is the difference?

  26. 3. XML concepts: elements and attributes In the logical structure Child elements of a parent element are ordered. The writing order of attributes in an element is insignificant.

  27. 3. XML concepts: elements and attributes 2. child element 1. child element 2. child element Different structures: <lastname> <earlier>Rantanen</earlier> <now>Korhonen </now> </lastname> 1. child element <lastname> <now>Korhonen </now> <earlier>Rantanen</earlier> </lastname>

  28. 3. XML concepts: elements and attributes Equivalent solutions: <lastname earlier=“Rantanen” now=“Korhonen”> </lastname> <lastname now=“Korhonen” earlier=“Rantanen” > </lastname>

  29. 3. XML concepts: Unicode XML documents encoded in:Unicode intended for content written in any natural language of the world The development work done by the Unicode Consortium The latest version:Unicode 5.1.0

  30. 3. XML concepts: DTD XML is a meta language intended to define languages for special application areas Document Type Definition (DTD) is the mechanism to define languages

  31. 3. XML concepts: DTD DTD : <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ELEMENT line (#PCDATA)> ]> Example 1 meets the constraints defined in the DTD.

  32. 3. XML concepts: DTD Attributes added <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]>

  33. 3. XML concepts: DTD DTD can be attached to a document • as in an internal subset • as an external subset • by combining internal and external markup declarations DTD consists of all markup declarations together.

  34. 3. XML concepts: DTD Internal DTD <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection>

  35. 3. XML concepts: DTD <?xml version="1.0"?> <!DOCTYPE rhymecollection SYSTEM ”myrhyme.dtd”> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> System identifier ”myrhyme.dtd" gives the address for the external DTD

  36. 3. XML concepts: DTD Text Declaration markup declarations in ”myrhyme.dtd”: <?xml version="1.0"?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]>

  37. 3. XML concepts: DTD DTD is just one definition mechanism available for constraining XML data. The most important: • XML Schema • RELAX NG The term schema or (XML schema) can refer to a definition written by any definion mechanism developed for XML data. The languages for defining schemas are called schema languages. 37

  38. 3. XML concepts: XML application An XML application is an XML-based language, (usually) defined by some schema language. Examples of XML applications: • XHTML:http://www.w3.org/TR/xhtml1/ • RSS (Really Simple Syndication):http://blogs.law.harvard.edu/tech/rss • TEI (Text Encoding Initiative):http://www.tei-c.org/index.xml • ebXML (Electronic Business using XML):http://www.ebxml.org/

  39. 3. XML concepts: XML application XML -- SGML – HTML -- XHTML • XML is a subset of SGML • HTML is an SGML application • XHTML is an XML application

  40. 3. XML concepts: well-formed and valid Two kinds of constraints in the XML specification: • well-formedness constraints: all XML documents have to meet them and they are called well-formed • validity constraints: documents associated with a DTD and meeting the constraints (including that they have to meet the constraints expressed in the DTD) are called valid

  41. 3. XML concepts: well-formed and valid A requirement for well-formed documents: each child element has to be contained in the parent element <date><day>24<month>1</day></month><year>2005</year></date> NOT well-formed

  42. 3. XML concepts: well-formed and valid <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme xml:lang = “fi”> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> VALID, even though the attribute value is not correct

  43. 3. XML concepts: well-formed and valid <?xml version="1.0" ?> <!DOCTYPE rhymecollection [ <!ELEMENT rhymecollection (title?, rhyme+)> <!ELEMENT title (#PCDATA)> <!ELEMENT rhyme (line+)> <!ATTLIST rhyme xml:lang NMTOKEN #REQUIRED author CDATA #IMPLIED > <!ELEMENT line (#PCDATA)> ]> <rhymecollection> <rhyme> <line>See, see! What shall I see?</line> <line>A horse's head where his tail should be</line> </rhyme> </rhymecollection> NOT valid

  44. 3. XML concepts: Namespaces Often need to use elements and attributes originating from different environments (or applications). Vocabularies in two environments may include common names intended for different purposes. If multiple declarations used in a single DTD, name collisions must avoided.

  45. 3. XML concepts: Namespaces • XML namespaces • Provides a method for qualifying element and attribute names so that name collisions can be avoided • Motivation: modularity and documentation If a well-understood markup vocabulary for element and attribute names exists, it shoud be re-used rather than re-invented, especially if there is also software available. http://www.w3c.org/TR/REC-xml-names

  46. 3. XML concepts: Namespaces XML namespace Collection of names, identified by a URI No formal rules for defining names in a namespace URI (Uniform Resource Identifier) • URL (Uniform Resource Locator) or • URN (Uniform Resource Name) Generic Syntax, RFC 3986: http://www.ietf.org/rfc/rfc3986.txt In XML Names 1.1 URI has been replaced by IRI (Internationalized Resource Identifier, RFC 3987: http://www.rfc-editor.org/rfc/rfc3987.txt

  47. 3. XML concepts: Namespaces • Example • Namespace: http://uwaterloo.ca • Element names: department, name, professor, student, last_name, first_name, ... • Global attribute names: id, ... • Per-element-type attribute names: student: supervisor, ...

  48. 3. XML concepts: Namespaces Namespace declaration: defines a label (prefix) for the namespace and associates it to the namespace identifier (URI) Qualified name: a namespace prefix and a local part, separated by a colon <?xml version="1.0"?> <report xmlns:uw="http://uwaterloo.ca"> <uw:department> <uw:name>Department of Computer Science</uw:name> ... </report>

  49. 3. XML concepts: Namespaces Prefix xml is reserved for W3C development work and its identifier is http://www.w3.org/XML/1998/namespace. The namespace can be declared in a document but it can be used without declaration. Prefix xmlns is used only for declaring namespaces. It cannot be used as a name of a namespace.

  50. Open source software for experimentations: http://www.w3.org/Status 50

More Related