1 / 47

Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke

Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification. Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke. www.text-technology.de. Contents. Project description Approaches to Multiple Annotations multiple Levels

duc
Download Presentation

Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke www.text-technology.de

  2. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  3. Project description • The Project secondary information structuring and comparative discourse analysis (Sekimo) is part of the DFG-Forschergruppe 437 Text-technological modelling of information • Within this Project a corpus is annotated on different (linguistic) levels • Aim of the project: Inferring, Describing, and Modelling relations between these levels

  4. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  5. Standard Methodology • A corpus is annotated according to a given tag set • The tag set is defined in a document grammar (e.g. the TEI-DTD) • In general, different tag sets exist for annotating different kinds of documents (e.g. poems, encyclopedia) or different kinds of information (e.g. linguistic information) • In particular, a linguistic annotation can depend on: • theoretical assumptions • constituent structure, • functional structure, or • a (more) specific theory • the language • research questions

  6. Problems of the standard methodology • Levels of description are neglected or • Different levels of annotation are mixed up Difficulties • Multiple hierarchies within one document

  7. General Solutions (c.f. TEI-Guidelines) • concur: an optional feature of SGML (not available in XML) which allows multiple hierarchies to be marked up concurrently in the same document • milestone elements: empty elements which mark the boundaries between elements in a non-nesting structure • fragmentation of an item: the division of what logically is a single element into two or more parts, each of which nests properly within its context • virtual joins: the recreation of a virtual element from fragments of text, (requires a separate interpretation) • redundant encoding of information in multiple forms

  8. Multiple hierarchies and language data • Hypertext linking techniques are used for connecting multiple layers of annotation, e.g.: • Within the EU-Project NITE an annotation format has been developed which allows for specifying links between separate annotation layers • The annotation graphs (AGs) format uses a (possibly abstract) timeline as linking-layer • Modified versions of the AGs are applied by • the TASX-Annotator • the EXMARaLDA-Project

  9. Alternative Methodology • XML-based multi-layer annotation • Technically, each layer becomes a separate and independent XML-document • The same text is annotated several times • Advantages: • seems to be the only way to annotate multiple hierarchies without workarounds • each document instance uses its own DTD (or Schema), i.e. annotation formats are not mixed up • at any time a new annotation can be produced • transformation tools to the NITE and the TASX-format exist (Master’s Thesis by Jan F. Maas)

  10. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  11. Layer vs. Level • We distinguish annotation level vs. annotation layer • Annotation level refers to an abstract level of analysis • Annotation layer refers to the realisation of an annotation in e.g. XML • Examples of annotation levels: morphology in a linguistic grammar, text structure (sections, paragraphs,...), layout (lines and pages), thematic structure, rhetorical structure • Sometimes one layer contains several levels (e.g. HTML), but a level can also be distributed over several layers

  12. Annotation Process • Given: • the textual representation of language material (text) • the text is regarded as primary data • For each annotation layer the primary data is copied • The (copy of the) primary text is annotated according to a schema (e.g. a DTD) • Annotation can be prepared • in any XML-Editor (e.g.: XMetaL, XML-Spy, psgml-emacs) • special purpose annotation tool

  13. Sample annotation with a web-based, special purpose annotation tool This tool is used only for flat xml-structures, i.e. xml-annotations with non-nested elements

  14. Example:XML-Annotation with the emacs editor(useable for deep and flat annotations)

  15. Multi-layer-annotation tool (master's thesis by Stefan Michel; work in progress)

  16. Multiple Annotations • Drawbacks: • redundant • the separate documents are independent (i.e. not connected) • But: • since the documents contain exactly the same text, the text can function as the link • Solution: • a common representation format for all separate XML-documents

  17. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  18. Prolog-Representation • The Prolog-representation is based on work by Renear, Huitfeld, Dubin and Sperberg-McQueen • Original representation for an XML-Elementnode/2, i.e. the predicate node has two arguments • the position in the document tree • a value, e.g. element(corpus) • Extension node/2 is replaced by node/5 • The 3 new arguments: • annotation layer • starting–point of the annotated text • end-point of the annotated text

  19. Conversion from XML to Prolog (xml2prolog) • Implemented in Python • Input: 1 or more XML-Documents • Result: Collection of Prolog facts • Example: • the element <Root> is represented as the fact: node(AnnotationLayer, 0, 42332, [1], element(Root)). • the attribute att=val of the Element <Root> is represented as the fact: attr(AnnotationLayer, 0, 42332, [1], 'att', 'val').

  20. xml2prolog.py • Some options for the transformation process • compare: the primary data of the XML files are compared, if the primary data is not identical, the first difference is shown • pcdata/pcdatanodes: character data can be included • aggressive: whitespace is added or removed anywhere in document if whitespace is the reason for differences of the primary data • filter: some elements in some files should be filtered (including their textual content), e.g. <script> within HTML-documents

  21. Example: s h u c c h o u n o k e N NP s h u c c h o u n o k e N .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... COMP HD s h u c c h o u n o k e N NP.no - s h u c c h o u n o k e N COMP HD HD s h u c c h o u n o k e N VN PGen NF s h u c c h o u n o k e N joshi meishi meishi s h u c c h o u n o k e N bunsetsu[@type=dependent] bunsetsu

  22. Example (Collection of Prolog-Facts)

  23. Example (Collection of Prolog-Facts) annotation layer start- and endpoint nodes in DOM-tree element names attribute-value-pair data-contents

  24. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  25. Relations between annotation Layers • Relations are inferred automatically • Special Prolog predicates have been implemented, for: • compare the annotation layers • Example (Identity): <w>tree</w> <m>tree</m> <syll>tree</syll>

  26. Relations between Annotations Vgl. Durusau & Brook O'Donnell (2002) und Durand (1999) 1. <a>....................</a> <b>......</b> 2. <a>....................</a> <b>.........</b> 3. <a>....................</a> <b>.....................</b> 4. <a>....................</a> <b>................................</b> 5. <a>....................</a> <b>..........................................</b> 6. <a>....................</a> <b>......</b> 7. <a>....................</a> <b>....................</b> 8. <a>....................</a> <b>...............................</b> etc.

  27. Relations between annotation layers Relation Visualisation identity independence inclusion start point identity end point identity end point is starting point overlap range of element a range of element b

  28. Comparison of annotation layers • We distinguish two kinds of relations between • elements:relations between single instances of an element (relations) • relations between all occurrences of instances an element (meta-relations) • Prolog programs have been developed to infer both kinds of relations

  29. Prolog Implementation • Aims: • statistics on annotation layers • relations between occurrences of elements • meta-relations

  30. Example: deep annotation (HPSG)

  31. Statistics of the annotation according to HPSG ?- get_statistics. Please enter layer name or type "q" to exit, "h" for help : |: hpsg. Statistics for hpsg Number of Nodes : 14, Number of different Elements : 5 Number of Attributes : 1, Number of different A/V-pairs : 4 ------------------------------------------ Different elements and their occurrences : hpsg 1 nodesAndLabels 3 nonannotated-text 4 phrase 2 punctuation 4 ------------------------------------------ Attribute # occurrences # different values type 5 4 For information on occurrences of Attribute-Value-Pairs enter Attribute name or type q to quit. |: type. ( edgeCOMP,1 ) , ( edgeHD,2 ) , ( np,1 ) , ( np-no,1 )

  32. Relations between occurrences of elements • Query: How often does a certain relation between elements hold? chk_relation(Relation,Element1,Layer1,Element2,Layer2,L). Relation: a relation between elements (e.g. identity, overlap, or endA_is_starting_pointB) Element1: elementname of annotation Layer1 Element2: element name of annotation Layer2 L : result-list • It is also possible to infer examples and counter-examples of a certain relation

  33. Example:Relations between elements of the HPSG Annotation and the elements of a dialogue-annotation

  34. Ex.: Relations between HPSG-phrases and X ?- chk_relation(Relation,phrase,hpsg,X,dialogue,L). Relation = identity X = _G160 L = [] ; Relation = included_B_in_A X = _G160 L = [] ; Relation = included_A_in_B X = _G160 L = [[[phrase, dialogue, 2], [phrase, 2], [dialogue, 1]]] ; ... Relation = overlap_A X = _G160 L = [] Yes

  35. Meta-relations • If a certain relation holds for all instances of an element we defined meta-relation: • identity: At every occurrence of an element A in Layer1 an element B in Layer2 exists which spans the same range of characters • inclusion: • at every occurrence of an element A in Layer1 an element B in Layer2 exists which is included or is identical • the meta relation identity does not hold • overlap: At every occurrence of an element A in Layer1 an element B in Layer2 exists which overlaps with A • mixed: no meta-relations exist

  36. Meta-relations (cntd.) • identity - For all occurrences, the following configuration can found: <a>....................</a><b>....................</b> • inclusion - For all occurrences, one of the following configurations can be found: <a>....................</a> <b>................................</b> <a>....................</a> <b>..........................................</b> <a>....................</a><b>.......................................</b> <a>....................</a><b>....................</b> • overlap - For all occurrences, the following configuration can found: <a>....................</a> <b>....................</b>

  37. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  38. Context specification 1: Motivation • Often, general Meta-relations do not hold • In these cases, the elements can be classified according to structural properties within their layer • This allows to construct specific Meta-relations • A format to express the structural properties called “Context Specification Document“ (CSD) has been developed

  39. Context specification 2: Realization • Subclassification of element nodes via tree walking automata (TWA) • Underlying path-language for the construction of TWA: Caterpillar-Expressions (cf. Brüggemann-Klein and Wood, 2000): • moves: up, right, left, firstChild, lastChild • tests: isFirst, isLast, isLeaf, isRoot • test for element names • Kleene-star operator ‘*‘

  40. Sample application HD HD Caterpillar expressions caterpillarToComp: left ‘Comp’ caterpillarToNP:up ‘NP’ NP HD COMP NP.NO NF COMP HD VN PGen shucchou no k e N

  41. Context specification 3: Subclassification HD HD Caterpillar expressions caterpillarToComp: left ‘Comp’ caterpillarToNP:up ‘NP’ NP HD COMP NP.NO NF Relation holds for all ‘Comp‘ Elements COMP HD VN PGen Relation holds only for a subset shucchou no k e N

  42. Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification

  43. Unification of annotation layers I • Two document layers can be merged • This process has also been implemented in Prolog • The predicate (semt) receives four arguments. • layer1 (to be unified) • layer2 (to be unified) • list of elements which should be deleted in the process of unification • The result of the merger(again a collection of Prolog facts) is written to a new file specified in the fourth argument • The new database contains a copy of all layers in the input database plus the result layer • In case the unification results to a layer where the elements would not be properly nested, a second result layer (a difference list) is created.

  44. Unification of annotation layers II • The result database is re-converted to XML using a python program • If no difference list exists, the result of the merging of two layers can be linearised as an XML document straightforwardly • In case the result fact base contains a difference list, two different linearisations can be generated. • the default processing uses milestone elements to mark the borders of incompatible elements. • alternatively, the technique of fragmentation of elements can be invoked.

  45. Architecture P r o l o g Document-grammar Document-grammar Document-grammar Secondary level (next talk) Inference/ Query XML-docu-ments via Python Generation of XML – from the fact base Unification of annotation levels via Python External information Rules XML-docu-ments Rules

  46. Contents • Project description • Approaches to Multiple Annotations • Representation • Inferences • Context Specification • Unification

  47. Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke www.text-technology.de

More Related