introduction to corpus linguistics n.
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Corpus Linguistics PowerPoint Presentation
Download Presentation
Introduction to Corpus Linguistics

Loading in 2 Seconds...

play fullscreen
1 / 15

Introduction to Corpus Linguistics - PowerPoint PPT Presentation

  • Uploaded on

Introduction to Corpus Linguistics. Xiaofei Lu APLNG 482Y November 11, 2008. Overview. What is a corpus Corpus design and compilation Corpus annotation Corpus querying and analysis Resources. What is a corpus?. Leech (1992):

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction to Corpus Linguistics' - gil-cohen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to corpus linguistics

Introduction to Corpus Linguistics

Xiaofei Lu


November 11, 2008

  • What is a corpus
  • Corpus design and compilation
  • Corpus annotation
  • Corpus querying and analysis
  • Resources
what is a corpus
What is a corpus?
  • Leech (1992):
    • an unexciting phenomenon, a helluva lot of text, stored on a computer
  • Francis (1982):
    • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis
  • Sinclair (1991):
    • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
types of corpora
Types of corpora
  • General-purpose vs. specialized corpora
    • The British National Corpus
    • Michigan Corpus of Academic Spoken English
  • Native vs. learner corpora
    • International Corpus of Learner English
  • Monolingual vs. parallel & comparable corpora
    • The JRC-Acquis Multilingual Parallel Corpus
    • The English-Chinese Parallel Concordancer
  • Corpora representing one or diverse language varieties
    • International Corpus of English
  • Synchronic vs. diachronic corpora
  • Spoken vs. written corpora
corpus design
Corpus design
  • Purpose, type
  • Content: mode, interaction, domain, medium
  • Structure, size: comparability, proportions
  • Data sources, sampling
  • Design of the BNC
corpus annotation
Corpus annotation
  • Why annotate
  • Levels of corpus annotation
  • Difficulties for corpus annotation
why annotate
Why annotate
  • For linguistic research
    • Allow more effective corpus searches
  • For natural language processing
    • Spelling and grammar checking
    • Machine translation
    • Question answering
levels of corpus annotation
Levels of corpus annotation
  • Sentence and word segmentation
  • Part-of-speech (POS) tagging
  • Syntactic parsing
  • Semantic, pragmatic and discourse annotation
  • Learner corpora: error annotation
difficulties for corpus annotation
Difficulties for corpus annotation
  • Ambiguity
    • I saw a pig with binoculars.
    • Problems for tagging, parsing, & WSD
  • Unknown words
    • Identification
    • POS tagging
    • Semantic annotation
corpus querying and analysis
Corpus querying and analysis
  • Using windows- or web-based software
    • Good for processing raw corpora
    • Word frequency, concordances, lexical bundles, and keyword lists
    • Examples: AntConc and GOLD
  • Using natural language processing tools
    • Good for processing annotated corpora
    • Extracting occurrences of grammatical patterns
    • Examples: Stanford parser and Tregex
interpreting corpus data
Interpreting corpus data
  • Are frequency differences statistically significant?
    • w appears x times in an n-word corpus, and y times in an m-word corpus
    • Chi-square test
    • Fisher’s Exact Test
interpreting corpus data cont
Interpreting corpus data (cont.)
  • Collocation analysis
    • How strongly are x and y associated
    • Mutual information - Measures difference between observed and expected frequencies of (X,Y)
    • T-test - Measures confidence with which to claim strong association between X and Y
  • Books
    • Hunston (2002): Corpora in Applied Linguistics
    • McEnery (2006): Corpus-Based Language Studies
  • Journals
    • International Journal of Corpus Linguistics
    • Corpora
  • Websites and mailing lists
    • Bookmarks for corpus-based linguists
    • Linguistic data consortium
    • The corpora list
  • Corpus annotation and analysis tools
    • Stanford Natural Language Processing Group
  • Places for exploration
    • MICASE
    • BNC Online
  • Courses on corpus linguistics
    • Computational and Statistical Methods for Corpus Analysis (Summer 2009)
    • Seminar on Applied Corpus Linguistics (Fall 2009)
note on research project design
Note on research project design
  • Purpose of project
  • Corpus compilation and annotation
  • Corpus analysis
    • Bottom-up: from observations of recurring patterns to hypothesis and generalizations
    • Top-down: start with given categories and search for evidence of use and variance
  • Caution on generalizability