Corpus annotation and retrieval an introduction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Corpus annotation and retrieval: an introduction PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on
  • Presentation posted in: General

Corpus annotation and retrieval: an introduction. Paul Rayson Computing Department, Lancaster University Dawn Archer School of Humanities, University of Central Lancashire. Session outline. What is a corpus ? What is corpus linguistics ? Applying these techniques to historical data

Download Presentation

Corpus annotation and retrieval: an introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Corpus annotation and retrieval an introduction

Corpus annotation and retrieval: an introduction

Paul Rayson

Computing Department, Lancaster University

Dawn Archer

School of Humanities, University of Central Lancashire


Session outline

Session outline

  • What is a corpus?

  • What is corpus linguistics?

  • Applying these techniques to historical data

  • What research questions can we answer with CL techniques

    … in linguistics …?

    … in computing …?

    … in history …?

Text Mining for Historians

July 17-18 2007 Glasgow University


1 background

1. Background

Corpora, corpus linguistics, annotation, retrieval methods


Underlying assumption

Underlying assumption

  • Intuition is not enough to study language …

    • Reaction to Noam Chomsky’s focus on introspection in 1950s/60s

      • Empirical observation of naturally occurring data versus theory of how human language processing is actually undertaken

Text Mining for Historians

July 17-18 2007 Glasgow University


What is a corpus

What is a corpus?

  • Old meaning = “body of text” (Latin)

  • Now = (any) “collection of texts or language examples” – usually in an electronic format

  • Demonstrates extent to which CL-revival led by advances in computing technology

Text Mining for Historians

July 17-18 2007 Glasgow University


A corpus tends to be representative

A corpus tends to be “representative”

i.e. a balanced sample of a language or a particular variety of language --- c.f. national corpora (British, American, Czech, Polish …)

Reasoning?

  • Helps to remove intuitive bias

  • Helps us to find common/ rare phenomena

    Exceptions …?

Text Mining for Historians

July 17-18 2007 Glasgow University


And large

And large …

… because size helps us to:

  • Establish norms about the variety being studied

  • Reveal lots of cases of rare features of language

  • Zipf’s law

  • Text Mining for Historians

    July 17-18 2007 Glasgow University


    Size matters

    Size matters!

    Web

    Present day

    ? billion

    BNC

    1990s

    100 million

    Brown/LOB

    1960s

    1 million

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus annotation and retrieval an introduction

    Web

    Future

    ? billions

    Collins Bank of English

    Cambridge International Corpus

    Oxford English Corpus

    2006

    600 million – 1 billion

    Birmingham corpus

    1980

    10 million

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    So what is corpus linguistics

    So what is corpus linguistics?

    =the “study of language using corpora”

    = empirical methodology

    = a useful means of exploring:

    • Synchronic and diachronic variation

    • Syntax, semantics, pragmatics

    • Lexicography

    • Dialects, minority languages

    • Not just English

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus techniques we utilise

    Retrieval

    Frequency profiling

    Concordancing

    Collocations

    Key words

    Key domains

    Annotation

    POS tagging

    Semantic tagging

    Corpus techniques we utilise

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus annotation and retrieval an introduction

    • Annotation

      • Part of speechtagging

      • Semantic field tagging

    • Retrieval

      • Frequency lists

      • Concordances

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Key words

    Keywords

    Key words

    What are “key words”?

    And why are they so useful?

    Text or reference corpus

    Text

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus annotation and retrieval an introduction

    Key words

    If we compare text A

    … with text B

    … we can discover the most significant items within text A

    … and not only the frequent items

    Word Clouds

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Collocations

    Collocations

    • Collocation = a relationship between words that tend to occur together in texts

      • Words that tend to occur near word X are the collocates of word X (consider “fish and XXXXX”)

      • Based on frequency (how frequent separate vs. how frequent together)

    • The company a word keeps: implicit associations or assumptions

      • Bachelor: eligible, flat, life, days

      • Spinster: elderly, widows, sisters, parish

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus software

    Corpus software

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Modern methods in an historical setting focussing on emode period

    Modern methods in an historical setting (focussing on EmodE period)

    • Tools/methods don’t take account of spelling variation

      • Variant spelling detector (VARD)

    • The need to use historically valid taxonomies or thesauri, or revise our existing modern tagsets

      • Historical Thesaurus of English

      • Spevack (1993)

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Using automated systems of annotation on historical texts is problematic

    Using automated systems of annotation on historical texts is problematic …

    EModE texts pose the following “problems”:

    • Archaic –eth and –(e)st verb suffixes, e.g. doth, hath, hast, sayeth, etc., which persist in specialised contexts: religious and poetic usage

    • Fused forms, e.g. ’Tis (It is)

    • Spellings that are variable even in modern-day usage, e.g. center/centre, skilful/skillful/skilfull, the suffixes -or/-our, -ise/-ize

    • Archaic forms like howbeit, betwixt, for which no obvious modern equivalent exists

    • Compound words, e.g. it self, now adays, in stead

    • Proper names of Latin origin that are sometimes modernised, e.g. Galilaeo (Galileo)

    • Due to different conventions and compositing practices

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Previous work in

    Fuzzy search engine

    Aimed at successful retrieval for novice users without expertise in the text

    Expand the search term using known letter replacements

    Changing dictionary built in to corpus annotation software

    Back-dating inbuilt dictionaries by adding historical variants

    Previous work in …

    Corpus linguistics

    Information Retrieval

    Natural language processing

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Our scenario

    Our scenario

    POSTAGGER

    SEMTAGGER

    VARD:

    Detect variant spellings and insert modern equivalents

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    An important point about the vard

    An important point about the VARD

    Although the VARD allows for the detection and “normalisation” of variants to their modern equivalents, it should be noted that ...

    • The original variants are retained in the text

    • We’re not carrying out spell checking per se (no “correct” spelling in EmodE period) ...

    • Our ultimate aim is to develop a system that automatically regularises variants within a text to their modernised forms so that historical corpora become more amenable to further annotation and analysis.

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    2 historical data

    2. Historical data


    Existing corpora

    Existing corpora

    • What is already available:

      • LOB-family, Brown family (20th Century)

        • 15 genres: press, religion, skills & hobbies, biography, learned, fiction (detective, science, adventure), romance, humour

      • Lampeter (1640-1740)

        • Religion, Politics, Economy, Science, Law and Misc.

      • Corpus of English Dialogues (1560-1760)

        • Trial proceedings, depositions, drama, prose fiction

      • Helsinki (Old, Middle and Early Modern English)

      • Archer (1650-1990, sampled at 50 year periods)

        • Journals, letters, fiction, news, medicine, science

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus annotation and retrieval an introduction

    Other historical texts – not complied for corpus linguistics

    Book Search

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Corpus annotation and retrieval an introduction

    Changing English Across the 20th Century: a corpus-based studyucrel.lancs.ac.uk/20thCenturyEnglish/ Leverhulme Trust (2005-7)

    • Project outputs

    • Compile a new corpus of British English called Lancaster1901

    • Enhance the encoding and annotation of Lancaster1901 and the three existing corpora (Lancaster1931, LOB and FLOB)

    • 10 conference presentations

    • 1 book chapter

    • 1 book

    • 2 journal articles

    • Background:

    • Recent observations of significant shifts having occurred among expressions of obligation/necessity in the period 1961-1991 e.g.

      • a decline of the central modals MUST and NEED

      • a spread of the semi-modals HAVE TO, NEED TO

    • Research questions

    • Are these changes recent

    • How do these changes compare to the development of the semantic field of OBLIGATION/ NECESSITY as a whole?

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Application 2 historical cl

    In particular, courtroom research (1640+), from a linguistic perspective

    Utilise a specially designed corpus – Sociopragmatic Corpus – which has been annotated for:

    age, gender, status and role.

    speech acts such as questions, requests and commands

    Application 2: Historical CL

    <P 37>

    [$ (^Record.^) $] <u stfunc="fol-ini" force="q" q="qy" qtype="d" qform="dec" speaker="s" spid="s4tgiles001" spsex="m" sprole1="re" spstatus="1" spage="8" addressee="s" adid="s4tgiles027" adsex="f" adrole1="w" adstatus="5" adage="x">He did not go out of your Company at all? </u>

    [$ (^Ann.^) $] <u stfunc=“res" force=“h" a=“ca“ a2=“ela“ speaker="s" spid="s4tgiles027" spsex=“f“ sprole1=“w“ spstatus=“5" spage="8“ addressee="s“ adid="s4tgiles001“ adsex=“m" adrole1=“m" adstatus=“1“ adage="x">Yes about Ten a Clock.</u>

    [$ (^Record.^) $] <u stfunc="fol" force="h" speaker="s“spid="s4tgiles001" spsex="m" sprole1="re" spstatus="1" spage="8" addressee="s" adid="s4tgiles027" adsex="f" adrole1="w" adstatus="5" adage="x">Woman you must be mistaken, he came to Town at Twelve or One, and might be in thy company, but it is plain he went to a Brokers in (^Long-lane^) , and so to the (^Artillery-Ground^) at (^Cripple-Gate^) , for I guess it might be so: Then they went to (^Whetstones-Park^) , and spent Six-Pence, and after that they went into (^Drury-lane^).</u>

    [$ (^Giles,^) $] <u stfunc="rep" force="h" speaker="s" spid="s4tgiles005" spsex="m" sprole1="d" spstatus="1" spage="x" addressee="s" adid="s4tgiles001" adsex="m" adrole1="re" adstatus="1" adage="8">My Lord, she don't say she was with us all the while, but we came to an House where she was, and several other People our Neighbours. </u>

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Some important findings

    Some important findings

    • Historical courtroom discourse is not just made up of questions and answers (even during examination sequences)

    • The frequency with which questions – and directives - were used, the function that they served, and their ability to achieve their social and/or interactional goal depended (in large part) on a number of socio-pragmatic factors:

      type and date of trial position in discourse

      role of user & addressee ultimate aim of interaction

    • 1640-1760 was a period of emerging and changing roles

    • Now beginning to explore the nineteenth century, i.e. period in which the courtroom adopted advocacy in its modern form (Cairns 1998)

      • Utilising full trials: emerging need to consider opening/closing statements

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    Historical text mining htm

    Historical text mining (HTM)

    Historical theory

    HTM

    Natural language processing & Computational linguistics

    Corpus Linguistics

    Linguistic theory

    Corpus

    Empirical evidence to inform theory

    Statistical and rule-based language models

    Text Mining for Historians

    July 17-18 2007 Glasgow University


    3 over to you

    3. Over to you …


    What research questions would you like to answer but can t

    What research questions would you like to answer, but can’t?

    • Search engines for new text collections and digital libraries

    • Named entity extraction for GIS

    • Variant spellings

    • Historical text mining

    • New research methods in History

    Text Mining for Historians

    July 17-18 2007 Glasgow University


  • Login