lin 3098 corpus linguistics lecture 4 n.
Skip this Video
Download Presentation
LIN 3098 Corpus Linguistics – Lecture 4

Loading in 2 Seconds...

play fullscreen
1 / 31

LIN 3098 Corpus Linguistics – Lecture 4 - PowerPoint PPT Presentation

  • Uploaded on

LIN 3098 Corpus Linguistics – Lecture 4. Albert Gatt. In this lecture. Levels of annotation Corpus typology classification based on type and levels of annotation multilingual corpora. Part 1. Levels of corpus annotation (cont/d). Levels of linguistic annotation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'LIN 3098 Corpus Linguistics – Lecture 4' - tosca

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
in this lecture
In this lecture
  • Levels of annotation
  • Corpus typology
    • classification based on type and levels of annotation
    • multilingual corpora

LIN 3098 -- Corpus Linguistics

part 1

Part 1

Levels of corpus annotation (cont/d)

levels of linguistic annotation
Levels of linguistic annotation
  • part-of-speech (word-level)
  • lemmatisation (word-level)
  • parsing (phrase & sentence-level)
  • semantics (multi-level)
    • semantic relationships between words and phrases
    • semantic features of words
  • discourse features (supra-sentence level)
  • phonetic transcription
  • prosody

LIN 3098 -- Corpus Linguistics

  • Groups morphological variants of a word under the head word:
    • mexa’ (walk)
      • imxejt (I walked)
      • imxejna (we walked)
      • nimxu (we walk)
      • ...
  • Increasingly common these days.

Together , these form

a lemma

LIN 3098 -- Corpus Linguistics

lemmatisation example the susanne corpus
Lemmatisation example: the SUSANNE corpus
  • Format: word + tag + lemma

A05:0030.33 - VVDv said say

  • Every word in the corpus is on separate line.
  • Extremely useful for lexicography

Corpus file:sentence.word

POS tag






LIN 3098 -- Corpus Linguistics

automatic morphological analysis
Automatic morphological analysis
  • For some languages, there are reasonably good lemmatisers/ morphological analysers:
  • Examples for English:
    • morpha: built at the University of Sussex
    • EngTwol: commercial, by LingSoft.

LIN 3098 -- Corpus Linguistics

engtwol output
Engtwol output
  • undeniable:
    • "undeniable" <DER:ble> A ABS
      • (derived with –ble suffix)
      • adjective (A)
      • absolute (ABS) form
  • This is a rule-based analyser. There are others which use corpus-derived statistical patterns.

LIN 3098 -- Corpus Linguistics

semantic annotation i two types
Semantic annotation I: Two types
  • markup of semantic relations (e.g. predicate-argument structure)
    • currently used in parsed corpora, to mark up function-argument structures etc.
  • markup of features of word meaning (mainly, word senses)
    • has origins in content analysis to arrive at conclusions about how prominent particular concepts are
    • Now used in a lot of work on word sense disambiguation

LIN 3098 -- Corpus Linguistics

example of type 1 semantic markup penn treebank
Example of type 1 semantic markup (Penn Treebank)

(S (NPSBJ1 Chris)

(VP wants

(S (NPSBJ *1)

(VP to

(VP throw

(NP the ball))))))

  • Predicate Argument Structure:

wants(Chris, throw(Chris, ball))

Empty embedded subject

linked to NP subject no. 1

LIN 3098 -- Corpus Linguistics

semantic markup type 2 lexical features
Semantic markup type 2: lexical features
  • Most common type:
    • word-sense tagged corpora
  • Main idea:
    • disambiguate a word in context by tagging its sense
  • Often uses WordNet (Miller et al 1993)
    • WordNet is a lexical taxonomy which represents lexical relations within a large number of words.
      • including hyponymy (IS-A) relations etc
      • For each entry, all the (supposed) senses of the word are given.
    • Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense.

LIN 3098 -- Corpus Linguistics

wordnet senses move noun
WordNet senses: Move (noun)

(377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer")

(70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")

(57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")

(30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")

(5) move -- ((game) a player's turn to take some action permitted by the rules of the game)

LIN 3098 -- Corpus Linguistics

(130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")

(60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")

(52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")

(20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")

WordNet senses: Move (verb)

LIN 3098 -- Corpus Linguistics

check it out
Check it out!
  • Wordnet is freely available for download:

LIN 3098 -- Corpus Linguistics

word sense annotation other uses
Word sense annotation: other uses
  • tagging words with their semantic field (Wilson 1996)
    • plant life
    • men’s clothing
  • tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary:
    • social processes
    • negative emotions
  • This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system,
    • analyses a text and comes up with a profile of its personal/emotional content
    • relates this to some features of its author (gender, age…)

LIN 3098 -- Corpus Linguistics

discourse annotation
Discourse annotation
  • Most common:
    • text-level things such as paragraphs
  • Less common:
    • anaphoric NPs and reference (cf. example from lecture 3)
  • Even less common:
    • annotation of words which function as discourse cues (Stenstrom 1984):
      • apology (sorry), hedges (sort of), etc
    • annotation of rhetorical structure

LIN 3098 -- Corpus Linguistics

discourse annotating rhetorical structure i
Discourse: Annotating rhetorical structure (I)
  • Rhetorical Structure Theory (Mann and Thompson 1988):
    • views text as made up of “discourse units”
    • units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc
      • [Although Mr. Freeman is retiring,] [he will continue to work as a consultant for American Express on a project basis].
      • Second unit is the main one (nucleus)
      • First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact.
  • Recent corpus (Marcu et al 2003) is annotated with this information.

LIN 3098 -- Corpus Linguistics

phonetic transcription
Phonetic transcription
  • Not many phonetically transcribed corpora.
    • MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus.
    • Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis).

LIN 3098 -- Corpus Linguistics

annotating suprasegmentals
Annotating suprasegmentals
  • Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech.
  • Some transcription systems exist
    • TOBI (American)
    • Tonic Stress Marker (TSM; British)
    • define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc…

LIN 3098 -- Corpus Linguistics

problem oriented tagging
Problem-oriented tagging
  • If you’re interested in a particular problem, and no corpus exists, build your own!
  • Many corpora define problem-specific annotation schemes.

LIN 3098 -- Corpus Linguistics

example the tuna corpus
Example: the TUNA Corpus
  • Problem: How do people refer to objects using definite NPs?
    • Main interest: visual properties (colour, size etc)
    • Focus: semantics of definite NPs, i.e. what people choose to include in their description.
  • Method:
    • experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene”
    • annotation of descriptions based on semantics

LIN 3098 -- Corpus Linguistics

tuna corpus description
TUNA Corpus: description


<ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE>


<ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE>


Red sofa, bigger version.

  • Features of the corpus:
  • represents the “target” referent
  • also represents the “distractors” (from which the target must be distinguished)
  • semantically transparent: annotation goes beyond language

LIN 3098 -- Corpus Linguistics

part 2

Part 2

Multilingual corpora

why multilingual corpora
Why multilingual corpora?
  • comparative studies
    • syntax
    • morphology
  • the cornerstone of most research in automatic machine translation nowadays
    • most MT systems are statistical, trained on large repositories of parallel (e.g. English-Chinese) text.

LIN 3098 -- Corpus Linguistics

parallel corpora
Parallel corpora
  • Represents a text in its original language (L1), with a translation in another language (L2)
    • long history: Medieval polyglot bibles were among the first “parallel” corpora
  • Alignment:
    • Many parallel corpora align L1 and L2 at sentence level, sometimes also at word level…
    • Sentence-level alignment can be achieved automatically with very high accuracy!

LIN 3098 -- Corpus Linguistics

example smultron corpus
Example: SMULTRON corpus
  • Developed and released in 2007-8
  • Relatively small
  • Aligned texts in English, Swedish and German
    • E.g. Sophie’s World is one of the texts
  • Annotated with syntax, POS, morphology
  • Comes with a tool to view parallel syntactic trees.

LIN 3098 -- Corpus Linguistics

smultron example english sophie s world
SMULTRON example: English (Sophie’s World)

<s id=“s3”>


<t id="s3_1" word="Sophie" pos="NNP" morph="--"/>

<t id="s3_2" word="Amundsen" pos="NNP" morph="--"/>

<t id="s3_3" word="was" pos="VBD" morph="--"/>

<t id="s3_4" word="on" pos="IN" morph="--"/>

<t id="s3_5" word="her" pos="PRP$" morph="--"/>

<t id="s3_6" word="way" pos="NN" morph="--"/>

<t id="s3_7" word="home" pos="RB" morph="--"/>

<t id="s3_8" word="from" pos="IN" morph="--"/>

<t id="s3_9" word="school" pos="NN" morph="--"/>

<t id="s3_10" word="." pos="." morph="--"/>



This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc)

LIN 3098 -- Corpus Linguistics

smultron same sentence in german
SMULTRON: Same sentence in German

<s id=“3”>


<t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " />

<t id="s3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ />

<t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/>

<t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" />

<t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" />

<t id="s3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ />

<t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" />

<t id="s3_8" word="der" pos="ART" morph="--" lemma="die" />

<t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" />

<t id="s3_10" word="." pos="$." morph="--" lemma="--" />



Note: richer morphology, representation of lemmas, …

LIN 3098 -- Corpus Linguistics

translation corpora
Translation corpora
  • Not parallel.
  • Have different texts in two or more different languages, of the same genre.
  • Examples:
    • PAROLE corpus is a translation corpus for EU languages

LIN 3098 -- Corpus Linguistics

why translation corpora
Why translation corpora?
  • Parallel corpora, by definition, contain translation (L2)
    • can give rise to errors
    • artificiality and translation quality can be an issue
      • e.g. McEnery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation”
      • Problem can be overcome if the texts used are professionally translated.
  • Translation corpora have texts in two or more languages, “in the original”.
    • Data is more natural.

LIN 3098 -- Corpus Linguistics

  • We have now concluded our initial incursion into:
    • corpus construction
    • corpus annotation
    • corpus typology
  • Next up:
    • using corpora for linguistic research

LIN 3098 -- Corpus Linguistics