1 / 25

Coherence-based strategies for text-to-hypertext-conversion

h. Angelika Storrer & Anke Holler January 2004. Coherence-based strategies for text-to-hypertext-conversion. term. definition. definition. term. Overview. The HyTex -Project: user scenario and approach Coherence-based text-to-hypertext-conversion: main types of strategies

chiko
Download Presentation

Coherence-based strategies for text-to-hypertext-conversion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. h Angelika Storrer & Anke Holler January 2004 Coherence-based strategies for text-to-hypertext-conversion term definition definition term

  2. Overview The HyTex-Project: user scenario and approach Coherence-based text-to-hypertext-conversion: main types of strategies Focus: achieving cohesive closedness in hypertext nodes Annotation scheme for co-reference phenomena

  3. About the HyTex Project HyTex: »Hypertextualisierung auf textgrammatischer Grundlage« (Text-to-hypertext conversion on a text-grammatical basis) Tasks: • Segmentation: breaking down the document into nodes • Linking: connecting the nodes through intratextual, intertextual and extratextual hyperlinks • … on a text-grammatical basis: No simple 1:1-conversion; instead generation of hypertext-nodes through text-grammar-based annotations in the documents.

  4. Conversion Guidelines Guidelines: • Recoverability: generating hypertext views as additional layers while preserving the original document • coherence-based criteria for segmentation and linking

  5. User Scenario User with previous though no expert knowledge in a particular area (semi-expert) must acquire knowledge from a pool of scientific/expert texts within a given time interval e.g. within in the framework of • interdisciplinary cooperations • scientific journalism • specialised lexicography Our vision: make selective reading in this scenario more effective and convenient than it would be possible with printmedia.

  6. reading path (author) reading path (user) Form-based conversion strategies

  7. Problems of the form-based approach Problems on the Micro-Level: Solution: generating cohesive closedness based on text-grammatical annotation

  8. term occurence term definition reading path (author) reading path (user) Problems of selective reading (Macro-level)

  9. User Model Domain Knowledge Level Document Level Three-level architecture

  10. Weiterhin unterscheidet ernoch nach der Anzahl der in einen Link involvierten Anker in 1:1-Links, in denen ein Ausgangs-Anker mit genau einem Zielanker verknüpft ist; 1:n-Links, in denen ein Ausgangs-Anker mit mehreren Zielankern verbunden ist, und n:m-Links, in denen mehrere Anker unabhängig von der Traversierungsrichtung miteinander zu einem Linking-Muster kombiniert sind. Im Linking-Element von HTML sind nur 1:1-Links vorgesehen; die obige Spezifikation und das Konzept des „Extended Link“ (im Sinne der XLink-Spezifikation) sehen auch Links mit mehreren Ankern vor. Sequential text Operations for achieving cohesive closedness Strategies on the Micro-Level: cohesive closedness • anaphora resolution • linking • elision • expansion

  11. Strategies on the Micro-Level: cohesive closedness cohesively autonomous version combined with expansion of the visual field

  12. Prerequisites 1. Annotation of cohesive markers in the corpus documents: • co-reference und co-specification • connectives • text-deictic expressions 2. Rules for the automatic transformation of cohesive cues in order to achieve cohesive closedness

  13. Part II Annotation of coreference phenomena • Objective: • Strict separation of the annotation of the relation of coreference • from the annotation of anaphoric relations. • Motivation: • cross-document coreference (cf. Baldwin&Bagga 1998, Mitkov 2002)

  14. Why existing annotation schemes ... ... need to be extended? • guidelines published by Text Encoding Initiative (TEI) • task definition of the Message Understanding Conferences (MUC) • guidelines published by the project Multilevel Annotation Tools Engineering (MATE) All three formats are SGML or XML-based.

  15. Text Encoding Initiative (TEI) Example 1: The show was not listed on <name id='nbc'> NBC </name>‘s new schedule, although <seg id='network'> the network </seg> says it is still being considered. <linkGrptype='anaphoric link' targFunc='antecedent anaphor'> <link targType='name seg' targets='nbc network'/> </linkGrp> Problem: • No distinction between coreference and anaphora annotation.

  16. Message Understanding Conferences (MUC) Example 2: <COREF ID="9" TYPE="IDENT" REF="2" MIN="company">The New Orleans oil and gas exploration and diving operations company</COREF> added that <COREF ID="10" TYPE="IDENT" REF="9">it</COREF> doesn't expect any further adverse financial impact from the restructuring. (Hirschmann/Chinchor 1997)

  17. Message Understanding Conferences (MUC) Problems: (Van Deemter&Kibble 2001) • Elements of genuine coreference are mixed with elements of anaphora and predication. • Nonreferring expressions: (a) Whenever a solution emerged, we embraced it. • Bound anaphora: (b) Every TV network reported its profits. • Intensional contexts (c) Henry Higgins, who was formerly sales director of Sudsy Soaps, became president of Dreamy Detergents.

  18. Multilevel Annotation Tools Engineering (MATE) Example 3: The show was not listed on <coref:de ID = 'de_01'> NBC </coref:de>‘s new schedule, although <coref:de ID = 'de_02'> the network </coref:de> says it is still being considered. <coref:linktype = 'ident'href = 'coref.xml#id(de_02)'> <coref:anchor href = 'coref.xml#id(de_01)'/> </coref:link> Problem: • Every coreference relation is marked as an anaphoric relation.

  19. Conclusion Result: • None of the presented markups is suitable to account for cross-document coreference phenomena. Alternative proposal: • An annotation scheme that encodes coreference as a relation between the document level and the domain knowledge level.

  20. User Model Domain Knowledge Level (TermNet) - terminological knowledge (concepts and technical terms) • representation format: topic map Document Level Three-level Architecture Coreference annotation

  21. Annotation of coreference Markup: <corefLink deIDref = Value tmIDRef = Value /> Example: Das von <discourseEntity deID="deID_1" deType="nom">Kuhlen 1991</discourseEntity> skizzierte Grundmodell eines <discourseEntity deID="deID_2" deType="nom">Hypertextsystems </discourseEntity>orientiert sich am Vorbild von Datenbankmanagementsystemen. <semRel><corefLink deIDRef="deID_1" tmIDRef="unknown"/></semRel> <semRel><corefLinkdeIDRef="deID_2" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel>

  22. Annotation of cospecification Markup: <cospecLink relType = Value phorIDRef = Value antecedentIDRefs = Value /> Example: <discourseEntity deID="deID_3" deType="nom">Ein Link</discourseEntity> ist im Text meist farbig markiert. <discourseEntity deID="deID_4" deType="nom">Er</discourseEntity> ist dadurch gut sichtbar. <cospecLinkrelType="substitution" phorIDRef="deID_4" antecedentIDRefs="deID_3"/>

  23. Cross-document phenomena (1) Das von Kuhlen 1991 skizzierte Grundmodell eines <discourseEntity deID="deID_1" deType="nom">Hypertextsystems </discourseEntity>orientiert sich am Vorbild von Datenbankmanagementsystemen. <semRel><corefLinkdeIDRef="deID_1" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel> (2) Die Verwaltung dieser Annotationen galt lange als ein wichtiges Desiderat von <discourseEntity deID="deID_2" deType="nom"> Hypertextsystemen. </discourseEntity> <semRel><corefLinkdeIDRef="deID_2" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel> DOC 1 DOC 2

  24. Cross-document phenomena DOC 1 (3) Unter <discourseEntity deID="deID_1" deType="nom"> "Annotationen" </discourseEntity> werden in der Hypertextliteratur Anmerkungen und Notizen verstanden, die ein Hypertextnutzer während des Rezeptionsvorgangs zu den Inhalten eines Moduls anbringt. <semRel><corefLinkdeIDRef="deID_3" tmIDRef="TermNet-inferiert.xtm#Annotation1"/> </semRel> Annotation = gloss DOC 2 Annotation = markup (4) In der SGML/XML-Terminologie wird der Ausdruck <discourseEntity deID="deID_1" deType="nom"> "Annotation" </discourseEntity> allerdings meist in einem anderen Sinne verwendet, nämlich als Bezeichnung für die Auszeichnung von Dokumenten mittels Markup. <semRel><corefLinkdeIDRef="deID_4" tmIDRef="TermNet-inferiert.xtm#Annotation2"/> </semRel>

  25. Summary • We have discussed a general approach of text-to-hypertext conversion which is pursued in our HyTex-project. • We have presented strategies • for a coherence-based conversion of sequential text into hypertext • for achieving cohesive closedness of hypertext nodes • We have argued for a coreference annotation that relates expressions of the text to a WordNet-like model which represents terminological knowledge of the domain investigated.

More Related