Lela 30922 lecture 5
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

LELA 30922 Lecture 5 PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

LELA 30922 Lecture 5. Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation , London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf

Download Presentation

LELA 30922 Lecture 5

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lela 30922 lecture 5

LELA 30922Lecture 5

Corpus annotation and SGML

See esp.

R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf

CM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic Text Encoding and Interchange, ch. 2 “A Gentle Introduction to SGML”, available at

http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html


Annotation

Annotation

  • Difference between a corpus and a “mere collection of texts” is mainly due to the value added by annotation

  • Includes generic information about the text, usually stored in a “header”

  • But more significantly, annotations within the text itself


Why annotate

Why annotate?

  • Adds information

  • Reflects some analysis of text

    • Inasmuch as this may reflect commitment to some theoretical approach, this can be a barrier sometimes (but see later)

  • Increases usefulness/reusability of text

  • Multi-functionality

    • May make corpus usable for something not originally foreseen by its compilers


Golden rules of annotation

Golden rules of annotation

  • Recoverability

    • It should always be possible to ignore the annotation and reconstruct the corpus in its raw form

  • Extricability

    • Correspondingly, annotations should be easily accessible so they can be stored separately if necessary (“Before and after” versions)

  • Transparency: documentation

    • Purpose and meaning of annotations

    • How (eg manually or automatically), where and by whom annotations were done

      • If automatic, information about the programs used

    • Quality indication

      • Annotations almost inevitably include some errors or inconsistencies

      • To what extent have annotations been checked?

      • What is the measured accuracy rate, and against what benchmark?


Theory neutrality

Theory-neutrality

  • Schools of thought

    • Annotations may reflect a particular theoretical approach, and this should be acknowledged

  • Consensus

    • corpus annotations which are more (rather than less)theory-neutral will be more widely used

    • given the amount of work involved, it pays to be aware of the descriptive traditions of the relevant field

  • Standards

    • There are very few absolute standards, but some schemes can become de facto standards through widespread use

    • For example, BNC designers were aware of the likely side effects of any decisions (regarding annotation) that they took


Types of annotation

Types of annotation

  • Plain corpus: it appears in its existing raw state of plain text

  • Corpus marked up for formatting attributes e.g. page breaks, paragraphs, font sizes

  • Corpus annotated with identifying information, such as title, author, genre, register, edition date

  • Corpus annotated with linguistic information

  • Corpus annotated with additional interpretive information, eg error analysis in learner corpus


Levels of linguistic annotation

Levels of linguistic annotation

  • Paragraph and sentence-boundary disambiguation

    • Naive fullstop+space+capital unreliable for genuine texts

    • May also involve distinguishing titles/headings from running text

  • Tokenization: identification of lexical units

    • multi-word units, cliticised words (eg can’t)

  • Lemmatisation: identification of lemmas (or lexemes)

    • Makes accessible variants of lexemes for more generic searches

    • May involve some disambiguation (eg rose)


Levels of linguistic annotation1

Levels of linguistic annotation

  • POS tagging (grammatical tagging)

    • assigning to each lexical unit a code indicating its part of speech

    • most basic type of linguistic corpus annotation and forms an essential foundation for further forms of analysis

  • Parsing (treebanking)

    • Identification of syntactic relationships between words

  • Semantic tagging

    • Marking of word senses (sense resolution)

    • Marking of semantic relationships eg agent, patient

    • Marking with semantic categories eg human, animate


Levels of linguistic annotation2

Levels of linguistic annotation

  • Discourse annotation

    • especially for transcribed speech

    • Identifying discourse function of text eg apology, greeting

    • or other pragmatic aspects, eg politeness level,

  • Anaphoric annotation

    • Identification of pronoun reference

    • and other anaphoric links (eg different references to the same entity)

  • Phonetic transcription (only in spoken language corpora)

    • Indication of details of pronunciation not otherwise reflected in transcription eg weak forms,

    • Explicit indication of accent/dialect features eg vowel qualities, allophonic variation

  • Prosodic annotation (only in spoken language corpora)

    • Suprasegmental iformation, eg stress, intonation, rhythm


Some examples

Some examples

PROSODIC ANNOTATION, LONDON-LUND CORPUS:

well ^very nice of you to ((come and)) _spare the !t\/ime and #

^come and !t\alk # -

^tell me a’bout the - !pr\oblems#

And ^incidentally# .

^I [@:] ^do ^do t\ell me#

^anything you ‘want about the :college in ”!g\eneral

Source: Leech chapter in Garside et al. 1997


Lela 30922 lecture 5

EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS:

hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._.

EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS:

hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._.

EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN ENGLISH CORPUS:

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]

Source: http://ucrel.lancs.ac.uk/annotation.html


Lela 30922 lecture 5

ANAPHORIC ANNOTATION OF AP NEWSWIRE

S.1 The state Supreme Court has refused to release Rahway State Prison inmate James Scott on bail.

S.2 The fighter is serving 30-40 years for a 1975 armed robbery conviction.

S.3 Scott had asked for freedom while he waits for an appeal decision.

S.4 Meanwhile, his promoter, Murad Muhammed, said Wednesday he netted only $15,250 for Scott's nationally televised light heavyweight fight against ranking contender Yaqui Lopez last Saturday.

S.5 The fight, in which Scott won a unanimous decision over Lopez, grossed $135,000 for Muhammed's firm, Triangle Productions of Newark, he said.

S.1 (0) The state Supreme Court has refused to release

{1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on bail .

S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed robbery conviction .

S.3 (1 Scott 1) had asked for freedom while <1 he waits for an appeal decision .

S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} , said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's nationally televised light heavyweight fight against {5 ranking contender 5}} (5 Yaqui Lopez 5) last Saturday 4) .

S.5 (4 The fight , in which [1 Scott 1] won a unanimous decision over (5 Lopez 5) 4) , grossed $135,000 for [6 [3 Muhammed 3] 's firm 6], {{6 Triangle Productions of Newark 6} , <3 he said .

Source: http://ucrel.lancs.ac.uk/annotation.html


Lela 30922 lecture 5

SGML

  • Although none of the examples just shown use it, for all but the simplest of mark-up schemes, SGML is widely recommended and used

  • SGML = standard generalized mark-up language

  • Actually suitable for all sorts of things, including web pages (HTML is SGML-conformant)


What is a mark up language

What is a mark-up language?

  • Mark-up historically referred to printer’s marks on a manuscript to indicate typesetting requirements.

  • Now covers all sorts of codes inserted into electronic texts to govern formatting, printing, or other information.

  • Mark-up, or (synonymously) encoding, is defined as any means of making explicit an interpretation of a text.

  • By “mark-up language” we mean a set of mark-up conventions used together for encoding texts. Language must specify

    • what mark-up is allowed

    • what mark-up is required

    • how mark-up is to be distinguished from text

    • what the mark-up means

  • SGML provides the means for doing the first three

  • Separate documentation/software is required for the last

    • eg (1) difference between identifying something as <emph>and how that appears in print; (2) why something may or may not be tagged as a “relative clause”


Rules of sgml

Rules of SGML

  • SGML allows us to define

    • Elements

    • Specific features of elements

    • Hierarchical/structural relations between elements

  • These specified in a “document type definition” (DTD)

  • DTD allows software to be written to

    • Help annotators annotate consistently

    • Explore documents marked-up


Elements in sgml

Elements in SGML

  • Have a (unique) name

  • Semantics of name are application dependent

    • up to designer to choose appropriate name, but nothing automatically follows from the choice of any particular name

  • Each element must be explicitly marked or tagged in some way

    • Most usual is with <element>and </element>pairs, called start- and end-tags

    • Much SGML-compliant software seems to allow start-only tags

    • &element; (esp. useful for single words or characters)

    • _tag suffix


Attributes

Attributes

  • Elements can have named attributes with associated values

  • When defined, values can be identified as

    • #REQUIRED: must be specified

    • #IMPLIED: optional

    • #CURRENT: inferred to be the same as the last specified value for that attribute

  • Values can be from a predefined list, or can be of a general type (string, integer, etc)


Dtd document type definition

DTD (Document type definition)

  • Helps to impose uniformity over the corpus

  • Defines the (expected or to-be-imposed) structure of the document

  • For each element, defines

    • How it appears (whether end tags are required)

    • What its substructure is, ie what elements, how many of them, whether compulsory or not


Example of dtd

Example of DTD

<!ELEMENT anthology - - (poem+)>

<!ELEMENT poem - - (title?, stanza+ | couplet+)>

<!ELEMENT title - O (#PCDATA) >

<!ELEMENT stanza - O (line+) >

<!ELEMENT couplet – O (cline, cline) >

<!ELEMENT (line | cline) O O (#PCDATA) >

  • Start and end tags necessary (-) or optional (O)

  • Anthology consists of 1 or more poems

  • Poem has an optional title, then 1 or more stanzas or 1 or more couplets

  • Title consists of “parsed character data”, ie normal text

  • Stanza has one or more lines, couplet has two lines

  • Both lines and clines have the same definition: normal text


Attributes1

Attributes

<!ATTLIST poem

id ID #IMPLIED

status (draft | revised | published) draft >

  • DTD defines the attributes expected/required for each element

  • A poem has an id and a status

  • Value of id is any identifier, and is optional

  • Status is one of three values, default draft


Lela 30922 lecture 5

<anthology>

<poem id=12 status=revised>

<title>It’s a grand old team</title>

<stanza>

<line>It’s a grand old team to play for

<line>It’s a grand old team to support

<line>And if you know your history

<line>It’s enough to make your heart go

Whoooooah

</stanza>

</poem>

<poem id=13>

...

</poem>

</anthology>


Mark up exemplified

Mark-up exemplified

RAW TEXT:

Two men retained their marbles, and as luck would have it they're both roughie-toughie types as well as military scientists - a cross between Albert Einstein and Action Man!

TOKENIZED TEXT:

<w orth=CAP>Two</w> <w>men</w> <w>retained</w> <w>their</w> <w>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w>'re</w> <w>both</w>

<w>roughie-toughie</w> <w>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w>scientists <c PUN>&mdash;</c></w> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w>

<w orth=CAP>Einstein</w> <w>and</w>

<w orth=CAP>Action</w> <w orth=CAP>Man<c PUN>!</c>


Lela 30922 lecture 5

LEMMATIZED TEXT:

<w orth=CAP>Two</w> <w lem=man>men</w>

<w lem=retain>retained</w> <w>their</w>

<w lem=marble>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w lem=be>'re</w> <w>both</w>

<w>roughie-toughie</w> <w lem=type>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w lem=scientist>scientists</w> <c PUN>&mdash;</c> <w>a</w> <w>cross</w> <w>between</w>

<w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w>

<w orth=CAP>Man</w><c PUN>!</c>


Lela 30922 lecture 5

POS TAGGED TEXT:

<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w>

<w VVD lem=retain>retained</w> <w DPS>their</w>

<w NN2 lem=marble>marbles</w><c PUN>,</c>

<w CJC>and</w> <w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w>

<w PNP>they</w><w VBB lem=be>'re</w>

<w AV0>both</w> <w AJ0>roughie-toughie</w>

<w NN2>types</w> <w AV0>as</w> <w AV0>well</w>

<w CJS>as</w> <w AJ0>military</w>

<w NN2>scientists</w> <c PUN>&mdash</c>

<w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <w NP0>Albert</w> <w NP0>Einstein</w>

<w CJC>and</w> <w NN1>Action</w>

<w NN1-NP0>Man<c PUN>!</c>


Lela 30922 lecture 5

POS TAGGED TEXT with idioms and named entities:

<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w>

<phrase type=idiom><w VVD lem=retain>retained</w> <w DPS>their</w>

<w NN2 lem=marble>marbles</w></phrase><c PUN>,</c>

<w CJC>and</w> <phrase type=idiom><w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w></phrase>

<w PNP>they</w><w VBB lem=be>'re</w>

<w AV0>both</w> <w AJ0>roughie-toughie</w>

<w NN2>types</w>

<phrase type=compound pos=CJS><w AV0>as</w>

<w AV0>well</w> <w CJS>as</w></phrase>

<phrase type=compound pos=NN2><w AJ0>military</w> <w NN2>scientists</w></phrase> <c PUN>&mdash</c>

<w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <phrase type=compound pos=NP0><w NP0>Albert</w> <w NP0>Einstein</w></phrase>

<w CJC>and</w>

<phrase type=compound pos=NP0><w NN1>Action</w>

<w NN1-NP0>Man</phrase><c PUN>!</c>


  • Login