Comparing approaches to XML-based discourse modeling: Secondary Information Structuring

Comparing approaches toXML-based discourse modeling:Secondary InformationStructuring Felix Sasaki University of Bielefeld Research Group "Text-technology" Project "Sekimo" www.text-technology.de

Overview • Primary Information Structuring and its shortcuts for discourse modeling • A solution: Secondary Information Structuring • Its realization within our project: • Annotation format and analysis • A conceptual level • Mapping between data and conceptual level • Representation of the conceptual level • Resources developed within the framework • Related approaches

textual data <corpus> <bunsetsu> <VN>shucchou</VN> <PGen>no</PGen> </bunsetsu> <bunsetsu> <NF>ken</NF> </bunsetsu> </corpus> <corpus> <bunsetsu> <VN>shucchou</VN> <PGen>no</PGen> </bunsetsu> <bunsetsu> <NF>ken</NF> </bunsetsu> </corpus> annotations document grammar constructs <!ELEMENT bunsetsu (VN|PGen|NF)+> <!ELEMENT PGen (#PCDATA)> ... Primary Information Structuring <corpus> <bunsetsu> <VN>shucchou</VN> <PGen>no</PGen> </bunsetsu> <bunsetsu> <NF>ken</NF> </bunsetsu> </corpus> <!ELEMENT bunsetsu (VN|PGen|NF)+> <!ELEMENT PGen (#PCDATA)> ...

<corpus> <NP> <COMP><NP.NO><COMP> <VN>shucchou</VN></COMP> <HD><PGen>no</PGen></HD> </NP.NO></COMP><HD> <NF>ken</NF></HD> </NP> </corpus> Relation between theory-, language- or domain- specific document grammars and annotations? element for japanese dependency grammar element for HPSG grammar ? Its shortcuts for discourse modeling <corpus> <bunsetsu> <VN>shucchou</VN> <PGen>no</PGen> </bunsetsu> <bunsetsu> <NF>ken</NF> </bunsetsu> </corpus> <!ELEMENT bunsetsu (VN|PGen|NF)+> <!ELEMENT PGen (#PCDATA)> ... <!ELEMENT COMP (NP.NO|VN)+

Overview • Primary Information Structuring and its shortcuts • A solution: Secondary Information Structuring: • Annotation format and analysis • A conceptual level • Mapping between data and conceptual level • Representation of the conceptual level • Resources developed within the framework • Related approaches

Secondary Information Structuring Secondary Information Structuring bunsetsu NP Primary Information Structuring bunsetsu COMP HD COMP NP annotation format and annotation analysis HD Annotation 1 Annotation 2 Annotation n pool of unrelated document grammar constructs conceptual level: interrelated document grammar constructs multiple annotations of the same primary data

Annotation format and annotation analysis • Multiple annotations of the same primary data • Analysis 1: multilayer relations, for relations between annotations on separate layers: • identity • endpoint_is_startingpoint • ... • Analysis 2: caterpillar expressions, for relations within the tree-structure of a single layer if there are no meta-relations for annotation units: • Analysis 3: Sub-classification of annotation units according to Analysis 1 and 2

Secondary Information Structuring Secondary Information Structuring bunsetsu NP Primary Information Structuring bunsetsu COMP HD COMP NP HD Annotation 1 Annotation 2 Annotation n conceptual level

Conceptual level: Basic structure Model-specific concepts Model HPSG Head-general Comp-general Head-sub1 Head-sub2

Head-general HPSG H1 B2 Head-sub1 Head-general relationpartOf relationsubClassOf interconceptual properties Conceptual level: Relations Model-specific concepts Model HPSG Head-general Comp-general Head-sub1 Head-sub2

right ‘Comp-general‘ HPSG caterpillarToComp Head-general Comp-general starting point is end point starting point is end point Tripel Notation Head-sub1 Head-sub2 Head-general caterpillarToComp right ‘Comp-general’ Head-general starting point is end point Comp-general Head-sub1 starting point is end point Head-sub2 Interconceptual properties

Secondary Information Structuring Secondary Information Structuring bunsetsu NP Primary Information Structuring bunsetsu COMP HD mapping between data and conceptual level COMP NP HD Annotation 1 Annotation 2 Annotation n

Mapping between data and conceptual level • Key concept: Interconceptual properties equal configurations between annotations on different annotation layers, or caterpillar expressions! • Document grammar constructs which are the basis for the annotation layers are mapped manually to superordinated concepts • From all subordinated concepts, the mapping can be inferred automatically

right ‘Comp-General‘ HPSG caterpillarToComp Head-General Comp-General models starting point is end point starting point is end point manual mapping Head-Sub1 Head-Sub2 <xsd:element name=“hd“/> pool of document grammar constructs <xsd:element name=“bunsetsu“/> <xsd:element name=“Comp“/> ... Visualization of the mapping intensional, declarative description of axioms for document grammar constructs automatically inferred manual mapping extension: document grammar constructs and annotated documents

Operations between data and conceptual level theory-driven Interrelation Validation of Hypothesis, Transformation of Data (not yet implem.) Data-based Interrelation Secondary Information Structuring element element Primary Information Structuring Language- and theory-specific document grammars attribute element Annotation 1 Annotation 2 Annotation n

Conceptual level: Representation as RDFS • RDF Schema: “Resource Description Framework, Vocabulary Description Language” • Offers the constructs which are necessary for the representation of the models  Integration of many other, abstract resources, e.g. lexical knowledge (WordNet) • More expressive languages deploy RDFS, i.e. OWL  Ontological knowledge, e.g. (SUMO), or Linguistic ontologies (e.g. GOLD) which use these languages can be related to the conceptual level ”Abstract “ language resources can be combined with annotated data

rdfs:subClassOf rdf:property Visualization of the RDFS representation rdfs:Class A rdfs:Class B rdfs:Class A-1 rdfs:Class A-2 rdfs:Class B-1 rdfs:Class B-2 rdfs:Class A-1-1 rdfs:Class A-1-2 rdfs:Class B-1-1 rdfs:Class B-1-2

Resources: Primary Information Structuring Secondary Information Structuring hpsg-annotation of VERBMOBIL-treebank element element Primary Information Structuring bunsetsu-annotation of VERBMOBIL-treebank Language- and theory-specific document grammars attribute element annotation of tinkertoy-dialogues Annotation 1 Annotation 2 Annotation n

Resources: Secondary Information Structuring Secondary Information Structuring element element Primary Information Structuring Language- and theory-specific document grammars attribute element Japanese functional pragmatics (JadEx-Project) bunsetsu- categories HPSG-related categories Annotation 1 Annotation 2 Annotation n General Japanese linguistic categories

Overview • Primary Information Structuring and its shortcuts • A solution: Secondary Information Structuring: • Annotation format and analysis • A conceptual level • Mapping between data and conceptual level • Representation of the conceptual level • Resources developed within the framework • Related approaches

Related methodology • ISO initiative on Language Resources Standards TC37 SC 4 • The creation of general and specific annotation vocabularies: • VAML (Virtual Markup Language) • CAML (Concrete Markup Language) • Applied within the latest Version of the Corpus Encoding Standard • Difference to our methodology: Relations between VAMLs and CAMLs are primarily based upon tree-structured data

XML-based discourse modeling • Discourse Modeling • A modeling framework (ontologically empty), not a specific model • Document grammars supply annotation categories, without a specific interpretation • The expressive power of trees is enhanced • Applicable mainly for textual data, not multimodal domains • XML-based? Yes! • XML as an enhanced data model • Document grammars are sufficient and useful for Primary Information Structuring

Comparing approaches toXML-based discourse modeling:Secondary InformationStructuring Felix Sasaki University of Bielefeld Research Group "Text-technology" Project "Sekimo" www.text-technology.de

Comparing approaches to XML-based discourse modeling: Secondary Information Structuring