Ted sanders utrecht institute of linguistics universiteit utrecht
This presentation is the property of its rightful owner.
Sponsored Links
1 / 9

Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level. Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht. Coherence in discourse.

Download Presentation

Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ted sanders utrecht institute of linguistics universiteit utrecht

DiscAn: Towards a Discourse Annotation system for Dutch language corporaorwhy and how we would want to annotate corpora on the discourse level

Ted Sanders

Utrecht institute of Linguistics

Universiteit Utrecht


Coherence in discourse

Coherence in discourse

Many tourists come to Switzerland. They want to see the mountains.

Referential coherence

Many tourists come to Switzerland because they want to see the mountains.

Relational coherence

John was happy. It was a Saturday.

We do not need explicit linguistic indicators


Coherence in discourse 2

Coherence in discourse, 2

Coherence is a cognitive phenomenon

Coherence relations are conceptual relations that constitute coherence between discourse segments (minimally clauses)

Connectives, Cue Phrases and other lexical markers can but need not make this coherence explicit.

Coherence relations are the building blocks of discourse structure (causal, contrastive, additive)


In annotated corpora

In annotated corpora ?

The discourse level is largely lacking in annotated Dutch corpora

There is an international tendency towards discourse annotation:

  • The Penn Discourse Treebank (Prasad, Joshi, Webber et al.)

  • The Potsdam Corpus (Stede et al.)

    And at the same time, we do have much data on Dutch:

  • on connectives

  • Mainly causal

  • Across media (various written genres, spoken, chat)

  • At various stages of annotation


Larger research issues in the field

Larger research issues in the field

  • To be answered on the basis of annotated corpora

  • The meaning and use of connectives varies across languages: omdat vs. parce que vs. weil

  • Semantic-pragmatic restrictions on use

  • Similarities and differences in acquisition

  • We will start discourse annotation with a study on the category of causals


Annotation

Annotation

Some criteria:

Order: cause – consequence and vice versa

Subjectivity: want, puisque, since, denn vs. omdat parce que, because weil

Linguistic marking: yes/no, perspective etc.

Characteristics of the segments: propositional attitude, modality, tense, syntax…


Current situation 15 studies

Current situation: 15 studies….

Corpus connfragmnr s1s2 modality s1 modality s2 protags1 s2 relation

7omdat2502 176 176 11 irrelevant want feit 6 1 1 1 Irrelevant want feit Irrelevant want feit1

7omdat2502b 177 177 21 Spreker/auteur6211Expliciet aanwezigIrrelevant want feit1

7omdat2509 707 707 11irrelevant want feit6111Irrelevant want feitIrrelevant want feit1

7omdat2539 3320 3320 11irrelevant want feit6111Irrelevant want feitIrrelevant want feit1

7omdat2546 3810 3810 12irrelevant want feit33231Irrelevant want feitImpliciet19

7omdat2551 4357 4357 12irrelevant want feit31211Irrelevant want feitExpliciet aanwezig1

7omdat2525 2547 2547 31Spreker/auteur6211Expliciet aanwezigIrrelevant want feit1


The discan project has five main goals

The DiscAn project has five main goals:

  • standardize and open up an existing set of Dutch corpus analyses of coherence relations and discourse connectives;

  • develop the foundations for a discourse annotation system;

  • improve the metadata by investigating existing CMDI profiles or adding new profiles suited for this type of analysis;

  • inventorize the required categories and investigate to what extent these could be included in ISOcat categories for discourse;

  • an interdisciplinary discourse community of text-, corpus and computational linguists to initiate further research in a European context.


A model of analysis

A model of analysis

  • Var 1 Name of the coder (values: the names of the two authors)

  • Var 2 Number of the fragment (the values were present in the fragments)

  • Var 3 Utterance number(s) of the segment preceding want (S1)

  • Var 4 Utterance number(s) of the segment following want (S2)

  • Var 5 Propositional attitude of S1 (values: action, fact, opinion, observation,

  • knowledge, experience)

  • Var 6 Propositional attitude of S2 (values: action, fact, opinion, observation,

  • knowledge, experience)

  • Var 7 Identity of the conceptualizer in S1 (values: speaker/1st person, second person,

  • third person (nominal or pronominal, generic person)

  • Var 8 Identity of the conceptualizer in S2 (values: speaker/1st person, second person,

  • third person (nominal or pronominal, generic person)

  • Var 9 Type of relation expressed by want (values: non-volitional content, volitional

  • content, explanation of a mental state, epistemic, textual, speech act)

  • Var 10 Syntactic modification of want (values: no modification, coordinating

  • conjunction, intensifier, focus element)


  • Login