1 / 31

LIN 3098 Corpus Linguistics – Lecture 4

LIN 3098 Corpus Linguistics – Lecture 4. Albert Gatt. In this lecture. Levels of annotation Corpus typology classification based on type and levels of annotation multilingual corpora. Part 1. Levels of corpus annotation (cont/d). Levels of linguistic annotation.

tosca
Download Presentation

LIN 3098 Corpus Linguistics – Lecture 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIN 3098Corpus Linguistics – Lecture 4 Albert Gatt

  2. In this lecture • Levels of annotation • Corpus typology • classification based on type and levels of annotation • multilingual corpora LIN 3098 -- Corpus Linguistics

  3. Part 1 Levels of corpus annotation (cont/d)

  4. Levels of linguistic annotation • part-of-speech (word-level) • lemmatisation (word-level) • parsing (phrase & sentence-level) • semantics (multi-level) • semantic relationships between words and phrases • semantic features of words • discourse features (supra-sentence level) • phonetic transcription • prosody LIN 3098 -- Corpus Linguistics

  5. Lemmatisation • Groups morphological variants of a word under the head word: • mexa’ (walk) • imxejt (I walked) • imxejna (we walked) • nimxu (we walk) • ... • Increasingly common these days. Together , these form a lemma LIN 3098 -- Corpus Linguistics

  6. Lemmatisation example: the SUSANNE corpus • Format: word + tag + lemma A05:0030.33 - VVDv said say • Every word in the corpus is on separate line. • Extremely useful for lexicography Corpus file:sentence.word POS tag actual word head word (lemma) LIN 3098 -- Corpus Linguistics

  7. Automatic morphological analysis • For some languages, there are reasonably good lemmatisers/ morphological analysers: • Examples for English: • morpha: built at the University of Sussex • EngTwol: commercial, by LingSoft. LIN 3098 -- Corpus Linguistics

  8. Engtwol output • undeniable: • "undeniable" <DER:ble> A ABS • (derived with –ble suffix) • adjective (A) • absolute (ABS) form • This is a rule-based analyser. There are others which use corpus-derived statistical patterns. LIN 3098 -- Corpus Linguistics

  9. Semantic annotation I: Two types • markup of semantic relations (e.g. predicate-argument structure) • currently used in parsed corpora, to mark up function-argument structures etc. • markup of features of word meaning (mainly, word senses) • has origins in content analysis to arrive at conclusions about how prominent particular concepts are • Now used in a lot of work on word sense disambiguation LIN 3098 -- Corpus Linguistics

  10. Example of type 1 semantic markup (Penn Treebank) (S (NPSBJ1 Chris) (VP wants (S (NPSBJ *1) (VP to (VP throw (NP the ball)))))) • Predicate Argument Structure: wants(Chris, throw(Chris, ball)) Empty embedded subject linked to NP subject no. 1 LIN 3098 -- Corpus Linguistics

  11. Semantic markup type 2: lexical features • Most common type: • word-sense tagged corpora • Main idea: • disambiguate a word in context by tagging its sense • Often uses WordNet (Miller et al 1993) • WordNet is a lexical taxonomy which represents lexical relations within a large number of words. • including hyponymy (IS-A) relations etc • For each entry, all the (supposed) senses of the word are given. • Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense. LIN 3098 -- Corpus Linguistics

  12. WordNet senses: Move (noun) (377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer") (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire") (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility") (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path") (5) move -- ((game) a player's turn to take some action permitted by the rules of the game) LIN 3098 -- Corpus Linguistics

  13. (130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell") (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant") (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right") (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another") WordNet senses: Move (verb) LIN 3098 -- Corpus Linguistics

  14. Check it out! • Wordnet is freely available for download: • http://wordnet.princeton.edu/ LIN 3098 -- Corpus Linguistics

  15. Word sense annotation: other uses • tagging words with their semantic field (Wilson 1996) • plant life • men’s clothing • … • tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: • social processes • negative emotions • This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, • analyses a text and comes up with a profile of its personal/emotional content • relates this to some features of its author (gender, age…) LIN 3098 -- Corpus Linguistics

  16. Discourse annotation • Most common: • text-level things such as paragraphs • Less common: • anaphoric NPs and reference (cf. example from lecture 3) • Even less common: • annotation of words which function as discourse cues (Stenstrom 1984): • apology (sorry), hedges (sort of), etc • annotation of rhetorical structure LIN 3098 -- Corpus Linguistics

  17. Discourse: Annotating rhetorical structure (I) • Rhetorical Structure Theory (Mann and Thompson 1988): • views text as made up of “discourse units” • units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc • CONCESSION/CONTRAST relation: • [Although Mr. Freeman is retiring,] [he will continue to work as a consultant for American Express on a project basis]. • Second unit is the main one (nucleus) • First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact. • Recent corpus (Marcu et al 2003) is annotated with this information. LIN 3098 -- Corpus Linguistics

  18. Phonetic transcription • Not many phonetically transcribed corpora. • MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus. • Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis). LIN 3098 -- Corpus Linguistics

  19. Annotating suprasegmentals • Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech. • Some transcription systems exist • TOBI (American) • Tonic Stress Marker (TSM; British) • define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc… LIN 3098 -- Corpus Linguistics

  20. Problem-oriented tagging • If you’re interested in a particular problem, and no corpus exists, build your own! • Many corpora define problem-specific annotation schemes. LIN 3098 -- Corpus Linguistics

  21. Example: the TUNA Corpus • Problem: How do people refer to objects using definite NPs? • Main interest: visual properties (colour, size etc) • Focus: semantics of definite NPs, i.e. what people choose to include in their description. • Method: • experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene” • annotation of descriptions based on semantics LIN 3098 -- Corpus Linguistics

  22. TUNA Corpus: description <DESCRIPTION NUM="SINGULAR"> <ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE> <ATTRIBUTE NAME="type" VALUE="sofa"> sofa</ATTRIBUTE> <ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE> </DESCRIPTION> Red sofa, bigger version. • Features of the corpus: • represents the “target” referent • also represents the “distractors” (from which the target must be distinguished) • semantically transparent: annotation goes beyond language LIN 3098 -- Corpus Linguistics

  23. Part 2 Multilingual corpora

  24. Why multilingual corpora? • comparative studies • syntax • morphology • … • the cornerstone of most research in automatic machine translation nowadays • most MT systems are statistical, trained on large repositories of parallel (e.g. English-Chinese) text. LIN 3098 -- Corpus Linguistics

  25. Parallel corpora • Represents a text in its original language (L1), with a translation in another language (L2) • long history: Medieval polyglot bibles were among the first “parallel” corpora • Alignment: • Many parallel corpora align L1 and L2 at sentence level, sometimes also at word level… • Sentence-level alignment can be achieved automatically with very high accuracy! LIN 3098 -- Corpus Linguistics

  26. Example: SMULTRON corpus • Developed and released in 2007-8 • Relatively small • Aligned texts in English, Swedish and German • E.g. Sophie’s World is one of the texts • Annotated with syntax, POS, morphology • Comes with a tool to view parallel syntactic trees. LIN 3098 -- Corpus Linguistics

  27. SMULTRON example: English (Sophie’s World) <s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/> <t id="s3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s3_3" word="was" pos="VBD" morph="--"/> <t id="s3_4" word="on" pos="IN" morph="--"/> <t id="s3_5" word="her" pos="PRP$" morph="--"/> <t id="s3_6" word="way" pos="NN" morph="--"/> <t id="s3_7" word="home" pos="RB" morph="--"/> <t id="s3_8" word="from" pos="IN" morph="--"/> <t id="s3_9" word="school" pos="NN" morph="--"/> <t id="s3_10" word="." pos="." morph="--"/> </terminals> </s> This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc) LIN 3098 -- Corpus Linguistics

  28. SMULTRON: Same sentence in German <s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals> </s> Note: richer morphology, representation of lemmas, … LIN 3098 -- Corpus Linguistics

  29. Translation corpora • Not parallel. • Have different texts in two or more different languages, of the same genre. • Examples: • PAROLE corpus is a translation corpus for EU languages LIN 3098 -- Corpus Linguistics

  30. Why translation corpora? • Parallel corpora, by definition, contain translation (L2) • can give rise to errors • artificiality and translation quality can be an issue • e.g. McEnery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation” • Problem can be overcome if the texts used are professionally translated. • Translation corpora have texts in two or more languages, “in the original”. • Data is more natural. LIN 3098 -- Corpus Linguistics

  31. Summary • We have now concluded our initial incursion into: • corpus construction • corpus annotation • corpus typology • Next up: • using corpora for linguistic research LIN 3098 -- Corpus Linguistics

More Related