A Flexible XML-Based Glossary Approach for the Federal Government By Ken Sall for the US Federal XML Community of Practice January 19, 2005
Problem Statement • After examining standard glossary terminology (ISO 1087 and others), define an XML Schema or DTD that models “all useful” aspects of a term and its definition. • Should be applicable to any government agency. • Consider flexibility and collaborative development as key design criteria. Many different agencies may use the model and many individuals may author specific term definitions. • Create an XSLT stylesheet that knows about the model and displays an XML glossary instance document as HTML in any modern browser. • Eventually consider XSL-FO for PDF rendering of the glossary.
Design Goals • Standards-Based - XML element names are loosely based on an international standard, ISO 1087. • Flexible - The Glossary DTD, although initially a strawman to stimulate discussion, is fairly flexible with few required elements, many optional elements, and several repeatable elements. • Provides a Framework - Since so few elements are required, terms can be added even before definitions are known. These terms act as placeholders that are fully supported by the DTD and XSLT. (For example, see the stub terms "DTD" and "XSLT" in the example instance.)
Design Goals • Specialized - Any term may have multiple definitions so that different agencies may use the same term with their own specialized meaning, where necessary. • Collaborative - Since an XSLT stylesheet is used to sort the terms alphabetically, many individuals can work on their own glossary fragments (XML instances of the Glossary DTD). At any time, the various contributions can be easily merged without manual editing. • Leverages Links - Search links are automatically generated for each term by means of the XSLT, both to help kick-start and to augment the definition.
ISO 1087 Terminology (etc.) • Characteristic: Abstraction of a property of an object or of a set of objects. Note - Characteristics are used for describing concepts. [ISO 1087-1:2000, 3.2.4] • Concept: A unit of thought constituted through abstraction on the basis of properties common to a set of objects. Note - Concepts are not bound to particular languages. They are, however, influenced by the social or cultural background. (ISO 1087:1990) Unit of knowledge created by a unique combination of characteristics. [ISO 1087-1:2000, 3.2.1] • Definition: Statement which describes a concept and permits its differentiation from other concepts within a system of concepts. (ISO 1087:1990) Representation of a concept by a descriptive statement which serves to differentiate it from related concepts. [ISO 1087-1:2000, 3.3.1] Key: ISO 1087UsedISO 1087 & UsedUnused
ISO 1087 Terminology (etc.) • Designation: Representation of a concept by a sign which denotes it. [ISO 1087-1:2000, 3.4.1] • Dictionary [see terminology and vocabulary]: Structured collection of lexical units with linguistic information about each of them. (ISO 1087:1990) Key: ISO 1087UsedISO 1087 & UsedUnused
ISO 1087 Terminology (etc.) • Entry, Headword: The term headword appears in two different meanings. In lexicography, a headword is the word used as the heading in a dictionary entry or encyclopedia. In a descriptive terminology entry where no preference is given to any one term, there is no head term, but if preference is given to a term, head term is sometimes used in analogy to lexicography, as is main entry term. (Wright & Budin, 1997) Key: ISO 1087UsedISO 1087 & UsedUnused
ISO 1087 Terminology (etc.) • Glossary [see dictionary, terminology, vocabulary]: Alphabetical list of terms or words found in or relating to a specific topic or text. It may or may not include explanations. Note - The distinguishing criterion is that glossaries are considered to reside in backmatter attached to books and other publications rather than being independent works in their own right. Glossaries are sometimes perceived as being less scientific in intent and methodology than terminologies, terminology standards, and even vocabularies, although a certain degree of synonymy exists. (Wright & Budin, 1997)
ISO 1087 Terminology (etc.) • Nomenclature: System of terms which is elaborated according to pre-established naming rules. (ISO 1087:1990) • Object: Anything perceivable or conceivable. Note - Objects may also be material (e.g. an engine, a sheet of paper, a diamond), immaterial (e.g. a conversion ratio, a project plan) or imagined (e.g. a unicorn). [Adapted from ISO 1087-1:2000, 3.1.1] Key: ISO 1087UsedISO 1087 & UsedUnused
ISO 1087 Terminology (etc.) • Synonym: A word with the same meaning or nearly the same meaning as another word in the same language. (Longman Dictionary of English Language and Culture: Longman Group UK Limited 1992) Note: Terminologists distinguish between real synonyms, i.e. terms which can be substituted with each other whatever the context, and the more common quasi-synonyms, which can differ from one another by context and sometimes by subject field (Sager, 1990) • Term: Designation of a defined concept in a special language by a linguistic expression. Note - A term may consist of one or more words or even contain symbols. (ISO 1087:1990)
ISO 1087 Terminology (etc.) • Terminological Dictionary [see dictionary and vocabulary]: Dictionary containing terminological data from one or more specific subject fields. Note - admitted term: technical dictionary (ISO 1087:1990) • Terminological Record: Structured collection of terminological data relevant to one concept. (ISO 1087:1990) • Terminological Database: Structured sets of terminological records in an information processing system. (ISO 1087:1990) Key: ISO 1087UsedISO 1087 & UsedUnused
ISO 1087 Terminology (etc.) • Terminology Work: Any activity concerned with the systematization and representation of concepts or with the presentation of terminologies on the basis of established principles and methods. (ISO 1087:1990) • Vocabulary [see terminology, dictionary, glossary]: Terminological dictionary containing the terminology of a specific subject field or of related subject fields and based on terminology work. (ISO 1087:1990) Key: ISO 1087UsedISO 1087 & UsedUnused
Summary: ISO 1087 Terminology Unused ISO 1087 Terms Characteristic Designation Dictionary Nomenclature Object PreferredTerm – TBD? Terminological Dictionary / technical dictionary Terminological Record Terminological Database Terminological Dictionary Terminology Work Vocabulary ISO 1087 Terms Used: Concept Definition Term Used but not ISO 1087: Glossary Synonym RelatedTerm Additional Terms by Sall (next slide): Name Acronym ExpandedAcronym DefinitionSection Source Usage
Additional (Non-Standard) Terminology • Glossary – change to Dictionary, Vocabulary, Technical Dictionary or Terminology? • Name – added only to allow Term to be a container; could change Term to Entry and Name to Term? • Acronym – necessary option for technical terms • ExpandedAcronym – ditto • DefinitionSection - added simply as a repeatable container to encompass all aspects pertaining to a specific definition of a term • Source - useful for traceability and credibility • Usage – useful to have an optional example sentence for a given definition (use in context)
XML Example of One Term <Termid="ontology"> <Name>ontology</Name> <DefinitionSection> <Concept>semantic web</Concept> <Concept>knowledge management</Concept> <Definition>Defines the common words and concepts used to describe and represent an area of knowledge, and so standardizes the meanings. An ontology includes classes in the domains of interest, instances, relationships, properties and their values, functions of and processes involving the objects, and relevant constraints and rules.</Definition> <Source>Daconta, Obrst, Smith</Source> <Usage>An onotology can range from the simple notion of a taxonomy to a thesaurus, to a conceptual model, to a logical theory. [Daconta, Obrst, Smith]</Usage> <Synonym>classification system</Synonym> <RelatedTerm>taxonomy</RelatedTerm> <RelatedTerm>OWL</RelatedTerm> </DefinitionSection> <DefinitionSection> <Concept>philosophy</Concept> <Definition>[sometimes "Ontology"] the metaphysical study of the nature of being and existence</Definition> <Source>WordNet</Source> <Usage>Both the ontology and manner of human existence are of concern to Existentialism.</Usage> <Synonym>metaphysics</Synonym> </DefinitionSection> </Term>
XML Example: XSLT Details DefinitionSection based on Concept CSS Styling Optional and Repeatable Elements New DefinitionSection based on 2nd Concept Auto-generated Search Links
Collaboration – Merging Instances • Since a Glossary consists of one or more Terms, a relatively simple XSLT can be created to merge the Term elements for two or more XML instances. • This means different authors (from the same or different agencies) can work independently. • Issue: What if same Term is defined by different authors? Automatically add each definition, even though they may overlap/conflict, or manually edit collisions (could generate a conflict message)? • Issue: Should agency name be a Source or another element (e.g., AgencySource)? Advantage is that custom XSLT could extract or render terms on per agency basis, if desired. Should there be an optional, repeatable SourceLink element for a URL?
Next Steps • Determine interested agencies. • Establish funding. • Resolve terminology issues for the Glossary model. • Consider merge or replacement by GlossXML and/or XML Acronym Demystifier. • Need to finalize DTD or XML Schema before agencies start authoring. • Revise initial XSLT to match final Glossary model. • Determine repository and submission mechanisms. • Could be a good use for CORE.gov? • Coordinate with Plans for Derived XML Registry Prototype? • Write additional XSLT stylesheets for merging and pulling agency-specific terms, etc. • Develop XSL-FO stylesheets for PDF rendering of Glossary.