Corpus annotation
Download
1 / 20

CORPUS ANNOTATION - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

CORPUS ANNOTATION. Extralinguistic annotation (‘mark-up’) and linguistic annotation (‘tagging’, ‘parsing’, etc.) Why is mark-up essential in corpus building? What is TEI? What are the advantages and the disadvantages of (linguistic) annotation?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CORPUS ANNOTATION' - aphrodite-griffin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Corpus annotation
CORPUS ANNOTATION

  • Extralinguistic annotation (‘mark-up’) and linguistic annotation (‘tagging’, ‘parsing’, etc.)

  • Why is mark-up essential in corpus building?

  • What is TEI?

  • What are the advantages and the disadvantages of (linguistic) annotation?

  • What are the main methods of corpus annotation? What are the benefits and drawbacks of each one?

  • What are the main uses of tagging?

  • Other kinds of annotation


Corpus annotation1
CORPUS ANNOTATION

McEnery, T. R. Xiao and Y. Tono (2006), "Corpus mark-up" and "Corpus annotation", in Corpus-Based Language Studies: An Advanced Resource Book. London: Routledge, 22-45.


Extralinguistic annotation mark up vs linguistic annotation tagging parsing etc
Extralinguistic annotation (‘mark-up’) vs linguistic annotation (‘tagging’, ‘parsing’, etc.)

  • Extralinguistic annotation (‘mark-up’)

    • A system of standard codes inserted into an electronic document to provide information about the text.

    • Kinds of information provided by mark-up:

    • Internal organization of text: sections, paragraphs, sentences…

    • External (‘contextual’) information: source of text, authors, age, gender, textual category, number of speakers, etc.

  • Linguistic annotation (‘tagging’, ‘parsing’, etc.)

    • A system of standard codes inserted into an electronic document to provide linguistic information found in the text.


Examples of extralinguistic annotation mark up
Examples of Extralinguisticannotation (‘mark-up’)

�<title> How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) </title><!-- ASA-->�<title> Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context</title>�<title> The Scotsman: Arts section. Sample containing about 48246 words from a periodical (domain: arts) </title>�<title>32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings.</title>�<title>[Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce)</title>

�<person���age="Ag0"���dialect="XLO"���xml:id="PS5A1"���role="self"���sex="m"���soc="C2">��<name>Terry</name>��<age>14</age>��<occupation>student</occupation>��<dialect>London</dialect>�</person>


Examples of linguistic annotation tagging parsing etc
Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.)

Word-class tagging in the BNC:

apparently we eat more chocolate than any other country.

<w c5="AV0" hw="apparently" pos="ADV">apparently </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="eat" pos="VERB">eat </w> <w c5="DT0" hw="more" pos="ADJ">more </w> <w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w> <w c5="CJS" hw="than" pos="CONJ">than </w> <w c5="DT0" hw="any" pos="ADJ">any </w> <w c5="AJ0" hw="other" pos="ADJ">other </w> <w c5="NN1" hw="country" pos="SUBST">country</w> <c c5="PUN">.</c>


Examples of linguistic annotation tagging parsing etc1
Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.)

Syntactic/grammatical/formal tagging in ICE-GB:


Examples of linguistic annotation tagging parsing etc2
Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.)

Syntactic/grammatical/formal tagging in ICE-GB:


Why is mark up essential in corpus building
Why is mark-up essential in corpus building? ‘parsing’, etc.)

  • ‘Contextualization’ of texts in a corpus

    “Contextual information is needed to restore the context and to enable us to relate the specimen [i.e. the text] to its original habitat [i.e. its context]”

  • ‘Enrichment’ of raw data with textual (eg.: sentence boundaries and extra-textual information (source, author, age, sex

    “Mark-up adds value to a corpus and allows for a broader range of research questions to be addressed as a result.”

  • ‘Editing’ and “transcribing’: omissions (graphics, tables), foreign words, turn-taking, interruptions, overlappings, laughter


What is tei
What is TEI? ‘parsing’, etc.)

  • Mark-up schemes: COCOA, DCMI, OLAC, IMDI, CES, TEI (Text Encoding Initiative)

  • Aim of TEI:

    “to facilitate data exchange by standardizing the mark-up or encoding of information stored in electronic form.”

  • Example of use of TEI mark-up system: BNC

    BNC Header


Linguistic annotation tagging parsing etc
Linguistic annotation (tagging, parsing, etc.) ‘parsing’, etc.)

  • Linguistic information encoded within the corpus itself.

  • Like corpus mark-up, annotation adds value to a corpus:

    “Annotation is a crucial contribution to the benefit a corpus brings, since it enriches the corpus as a source of linguistic information for future research and development” (Leech, 1997, p.2)

  • As opposed to mark-up (which is ‘objective’), annotation is ‘interpretive’, i.e. implies a previous linguistic analysis or interpretation of text


Advantages of linguistic annotation
ADVANTAGES of (linguistic) annotation ‘parsing’, etc.)

  • Annotation facilitates the extraction of information from a corpus: Eg.: left, light, play; N, V; OD, OI, PP, Rel Cl

  • Speed of data extraction

  • Reliability

  • Reusability

  • Multifuncionality

  • Explicitness

  • Reference resource


Left the the bnc
‘parsing’, etc.)left’ the the BNC


Disadvantages of linguistic annotation
DISADVANTAGES of (linguistic) annotation ‘parsing’, etc.)

  • Annotation ‘clutters’ corpora:

    “Howevermuchannotationisaddedto a text, itisimportantfortheresearchertobeabletoseetheplaintext, unclutteredbyannotationallabels. Thebasicpatterning of thewordsalonemustbe observable at all times.” (Hunston, 2002:94)

    <w c5="AV0" hw="apparently" pos="ADV">apparently </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="eat" pos="VERB">eat </w> <w c5="DT0" hw="more" pos="ADJ">more </w> <w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w> <w c5="CJS" hw="than" pos="CONJ">than </w> <w c5="DT0" hw="any" pos="ADJ">any </w> <w c5="AJ0" hw="other" pos="ADJ">other </w> <w c5="NN1" hw="country" pos="SUBST">country</w> <c c5="PUN">.</c>

    …apparentlyweeat more chocaltethananyother country.


Disadvantages of linguistic annotation1
DISADVANTAGES of (linguistic) annotation? ‘parsing’, etc.)

  • Annotation imposes a linguistic analysis upon a corpus user:

    “Annotation should serve the needs of the corpus user, not determine the direction the investigation must take” (Hunston)

    Eg: OI in ICE:

    We gave THEMOI some food vs We gave some food TO THEMA

    ‘Dimonotransitive’ (dimontr)

    Dimonotransitive verbs (dimontr) are complemented by an Indirect Object only. They include show, ask, assure, grant, inform, promise, reassure, and tell.

    When I asked her, she burst into tears V(dimontr,past)

      I’ll tell you tomorrow V(dimontr,infin)

    Show me V(dimontr,imp)


Disadvantages of linguistic annotation2
DISADVANTAGES of (linguistic) annotation? ‘parsing’, etc.)

  • (Un)Reliability of annotation: accuracy / consistency


The press swung heavily to the left ‘parsing’, etc.)

  • Center for Sprogteknologi (University of Copenhagen) (http://cst.dk/online/pos_tagger/uk/index.html)

    the/DT press/NN swung/VBD heavily/RB to/TO the/DT left/VBN

  • CLAWS tagger (http://ucrel.lancs.ac.uk/claws/trial.html)

    The_AT0 press_NN1 swung_VVD heavily_AV0 to_PRP the_AT0 left_AJ0

  • Stanford parser (http://nlp.stanford.edu:8080/parser/)

    The/DT press/NN swung/VBD heavily/RB to/TO the/DT left/NN


Methods of corpus annotation
Methods of corpus annotation ‘parsing’, etc.)

  • AUTOMATIC

  • COMPUTER-ASSISTED (‘SEMI-AUTOMATIC’)

  • MANUAL


Types of corpus annotation
Types of corpus annotation ‘parsing’, etc.)

  • Phonological: syllable boundaries, prosodic features (stress, tone, pitch)

  • Morphological: prefixes, suffixes, stems

  • Lexico-grammatical (‘tagging’): part of speech (N, V), grammatical features (Sing, Pl, Past), lemma

  • Syntactic (‘parsing’): phrases, clauses, syntactic functions

  • Semantic: semantic field

  • Textual-Discoursal: anaphoric relations, theme/rheme, given/new information

  • Pragmatic: speech acts

  • Stylistic

  • Etc.


Types of corpus annotation1
Types of corpus annotation ‘parsing’, etc.)

  • -- Tagging (POS tags):

    • Annotation at UCREL;

    • CLAWS (the tagger used for the BNC, TIME, BYU American Corpus, etc);

    • Tagging in BNC

  • -- Parsing: Annotation in the ICE-GB


What are the main uses of tagging
‘parsing’, etc.)What are the main uses of tagging?’

  • Disambiguation and comparison of distribution/frequency/collocations of homographs: eg: left, light, play, deal

  • Distribution/Frequency of Word-Classes

  • Collocation of items with Word-classes (rather than with other individual items).

  • Sequences of word-classes

  • Etc.


ad