Lin 3098 corpus linguistics lecture 4
1 / 31

LIN 3098 Corpus Linguistics Lecture 4 - PowerPoint PPT Presentation

  • Uploaded on

LIN 3098 Corpus Linguistics – Lecture 4. Albert Gatt. In this lecture. Levels of annotation Corpus typology classification based on type and levels of annotation multilingual corpora. Part 1. Levels of corpus annotation (cont/d). Levels of linguistic annotation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'LIN 3098 Corpus Linguistics Lecture 4' - tosca

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lin 3098 corpus linguistics lecture 4

LIN 3098Corpus Linguistics – Lecture 4

Albert Gatt

In this lecture
In this lecture

  • Levels of annotation

  • Corpus typology

    • classification based on type and levels of annotation

    • multilingual corpora

LIN 3098 -- Corpus Linguistics

Part 1

Part 1

Levels of corpus annotation (cont/d)

Levels of linguistic annotation
Levels of linguistic annotation

  • part-of-speech (word-level)

  • lemmatisation (word-level)

  • parsing (phrase & sentence-level)

  • semantics (multi-level)

    • semantic relationships between words and phrases

    • semantic features of words

  • discourse features (supra-sentence level)

  • phonetic transcription

  • prosody

LIN 3098 -- Corpus Linguistics


  • Groups morphological variants of a word under the head word:

    • mexa’ (walk)

      • imxejt (I walked)

      • imxejna (we walked)

      • nimxu (we walk)

      • ...

  • Increasingly common these days.

Together , these form

a lemma

LIN 3098 -- Corpus Linguistics

Lemmatisation example the susanne corpus
Lemmatisation example: the SUSANNE corpus

  • Format: word + tag + lemma

    A05:0030.33 - VVDv said say

  • Every word in the corpus is on separate line.

  • Extremely useful for lexicography

Corpus file:sentence.word

POS tag






LIN 3098 -- Corpus Linguistics

Automatic morphological analysis
Automatic morphological analysis

  • For some languages, there are reasonably good lemmatisers/ morphological analysers:

  • Examples for English:

    • morpha: built at the University of Sussex

    • EngTwol: commercial, by LingSoft.

LIN 3098 -- Corpus Linguistics

Engtwol output
Engtwol output

  • undeniable:

    • "undeniable" <DER:ble> A ABS

      • (derived with –ble suffix)

      • adjective (A)

      • absolute (ABS) form

  • This is a rule-based analyser. There are others which use corpus-derived statistical patterns.

LIN 3098 -- Corpus Linguistics

Semantic annotation i two types
Semantic annotation I: Two types

  • markup of semantic relations (e.g. predicate-argument structure)

    • currently used in parsed corpora, to mark up function-argument structures etc.

  • markup of features of word meaning (mainly, word senses)

    • has origins in content analysis to arrive at conclusions about how prominent particular concepts are

    • Now used in a lot of work on word sense disambiguation

LIN 3098 -- Corpus Linguistics

Example of type 1 semantic markup penn treebank
Example of type 1 semantic markup (Penn Treebank)

(S (NPSBJ1 Chris)

(VP wants

(S (NPSBJ *1)

(VP to

(VP throw

(NP the ball))))))

  • Predicate Argument Structure:

    wants(Chris, throw(Chris, ball))

Empty embedded subject

linked to NP subject no. 1

LIN 3098 -- Corpus Linguistics

Semantic markup type 2 lexical features
Semantic markup type 2: lexical features

  • Most common type:

    • word-sense tagged corpora

  • Main idea:

    • disambiguate a word in context by tagging its sense

  • Often uses WordNet (Miller et al 1993)

    • WordNet is a lexical taxonomy which represents lexical relations within a large number of words.

      • including hyponymy (IS-A) relations etc

      • For each entry, all the (supposed) senses of the word are given.

    • Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense.

LIN 3098 -- Corpus Linguistics

Wordnet senses move noun
WordNet senses: Move (noun)

(377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer")

(70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")

(57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")

(30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")

(5) move -- ((game) a player's turn to take some action permitted by the rules of the game)

LIN 3098 -- Corpus Linguistics

(130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")

(60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")

(52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")

(20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")

WordNet senses: Move (verb)

LIN 3098 -- Corpus Linguistics

Check it out
Check it out!

  • Wordnet is freely available for download:


LIN 3098 -- Corpus Linguistics

Word sense annotation other uses
Word sense annotation: other uses

  • tagging words with their semantic field (Wilson 1996)

    • plant life

    • men’s clothing

  • tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary:

    • social processes

    • negative emotions

  • This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system,

    • analyses a text and comes up with a profile of its personal/emotional content

    • relates this to some features of its author (gender, age…)

LIN 3098 -- Corpus Linguistics

Discourse annotation
Discourse annotation

  • Most common:

    • text-level things such as paragraphs

  • Less common:

    • anaphoric NPs and reference (cf. example from lecture 3)

  • Even less common:

    • annotation of words which function as discourse cues (Stenstrom 1984):

      • apology (sorry), hedges (sort of), etc

    • annotation of rhetorical structure

LIN 3098 -- Corpus Linguistics

Discourse annotating rhetorical structure i
Discourse: Annotating rhetorical structure (I)

  • Rhetorical Structure Theory (Mann and Thompson 1988):

    • views text as made up of “discourse units”

    • units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc


    • [Although Mr. Freeman is retiring,] [he will continue to work as a consultant for American Express on a project basis].

    • Second unit is the main one (nucleus)

    • First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact.

  • Recent corpus (Marcu et al 2003) is annotated with this information.

  • LIN 3098 -- Corpus Linguistics

    Phonetic transcription
    Phonetic transcription

    • Not many phonetically transcribed corpora.

      • MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus.

      • Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis).

    LIN 3098 -- Corpus Linguistics

    Annotating suprasegmentals
    Annotating suprasegmentals

    • Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech.

    • Some transcription systems exist

      • TOBI (American)

      • Tonic Stress Marker (TSM; British)

      • define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc…

    LIN 3098 -- Corpus Linguistics

    Problem oriented tagging
    Problem-oriented tagging

    • If you’re interested in a particular problem, and no corpus exists, build your own!

    • Many corpora define problem-specific annotation schemes.

    LIN 3098 -- Corpus Linguistics

    Example the tuna corpus
    Example: the TUNA Corpus

    • Problem: How do people refer to objects using definite NPs?

      • Main interest: visual properties (colour, size etc)

      • Focus: semantics of definite NPs, i.e. what people choose to include in their description.

    • Method:

      • experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene”

      • annotation of descriptions based on semantics

    LIN 3098 -- Corpus Linguistics

    Tuna corpus description
    TUNA Corpus: description


    <ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE>

    <ATTRIBUTE NAME="type" VALUE="sofa"> sofa</ATTRIBUTE>

    <ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE>


    Red sofa, bigger version.

    • Features of the corpus:

    • represents the “target” referent

    • also represents the “distractors” (from which the target must be distinguished)

    • semantically transparent: annotation goes beyond language

    LIN 3098 -- Corpus Linguistics

    Part 2

    Part 2

    Multilingual corpora

    Why multilingual corpora
    Why multilingual corpora?

    • comparative studies

      • syntax

      • morphology

    • the cornerstone of most research in automatic machine translation nowadays

      • most MT systems are statistical, trained on large repositories of parallel (e.g. English-Chinese) text.

    LIN 3098 -- Corpus Linguistics

    Parallel corpora
    Parallel corpora

    • Represents a text in its original language (L1), with a translation in another language (L2)

      • long history: Medieval polyglot bibles were among the first “parallel” corpora

    • Alignment:

      • Many parallel corpora align L1 and L2 at sentence level, sometimes also at word level…

      • Sentence-level alignment can be achieved automatically with very high accuracy!

    LIN 3098 -- Corpus Linguistics

    Example smultron corpus
    Example: SMULTRON corpus

    • Developed and released in 2007-8

    • Relatively small

    • Aligned texts in English, Swedish and German

      • E.g. Sophie’s World is one of the texts

    • Annotated with syntax, POS, morphology

    • Comes with a tool to view parallel syntactic trees.

    LIN 3098 -- Corpus Linguistics

    Smultron example english sophie s world
    SMULTRON example: English (Sophie’s World)

    <s id=“s3”>


    <t id="s3_1" word="Sophie" pos="NNP" morph="--"/>

    <t id="s3_2" word="Amundsen" pos="NNP" morph="--"/>

    <t id="s3_3" word="was" pos="VBD" morph="--"/>

    <t id="s3_4" word="on" pos="IN" morph="--"/>

    <t id="s3_5" word="her" pos="PRP$" morph="--"/>

    <t id="s3_6" word="way" pos="NN" morph="--"/>

    <t id="s3_7" word="home" pos="RB" morph="--"/>

    <t id="s3_8" word="from" pos="IN" morph="--"/>

    <t id="s3_9" word="school" pos="NN" morph="--"/>

    <t id="s3_10" word="." pos="." morph="--"/>



    This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc)

    LIN 3098 -- Corpus Linguistics

    Smultron same sentence in german
    SMULTRON: Same sentence in German

    <s id=“3”>


    <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " />

    <t id="s3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ />

    <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/>

    <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" />

    <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" />

    <t id="s3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ />

    <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" />

    <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" />

    <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" />

    <t id="s3_10" word="." pos="$." morph="--" lemma="--" />



    Note: richer morphology, representation of lemmas, …

    LIN 3098 -- Corpus Linguistics

    Translation corpora
    Translation corpora

    • Not parallel.

    • Have different texts in two or more different languages, of the same genre.

    • Examples:

      • PAROLE corpus is a translation corpus for EU languages

    LIN 3098 -- Corpus Linguistics

    Why translation corpora
    Why translation corpora?

    • Parallel corpora, by definition, contain translation (L2)

      • can give rise to errors

      • artificiality and translation quality can be an issue

        • e.g. McEnery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation”

        • Problem can be overcome if the texts used are professionally translated.

    • Translation corpora have texts in two or more languages, “in the original”.

      • Data is more natural.

    LIN 3098 -- Corpus Linguistics


    • We have now concluded our initial incursion into:

      • corpus construction

      • corpus annotation

      • corpus typology

    • Next up:

      • using corpora for linguistic research

    LIN 3098 -- Corpus Linguistics