1 / 15

Introduction to Corpus Linguistics

Introduction to Corpus Linguistics. Xiaofei Lu APLNG 482Y November 11, 2008. Overview. What is a corpus Corpus design and compilation Corpus annotation Corpus querying and analysis Resources. What is a corpus?. Leech (1992):

gil-cohen
Download Presentation

Introduction to Corpus Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Corpus Linguistics Xiaofei Lu APLNG 482Y November 11, 2008

  2. Overview • What is a corpus • Corpus design and compilation • Corpus annotation • Corpus querying and analysis • Resources

  3. What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Francis (1982): • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language

  4. Types of corpora • General-purpose vs. specialized corpora • The British National Corpus • Michigan Corpus of Academic Spoken English • Native vs. learner corpora • International Corpus of Learner English • Monolingual vs. parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer • Corpora representing one or diverse language varieties • International Corpus of English • Synchronic vs. diachronic corpora • Spoken vs. written corpora

  5. Corpus design • Purpose, type • Content: mode, interaction, domain, medium • Structure, size: comparability, proportions • Data sources, sampling • Design of the BNC

  6. Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation

  7. Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Machine translation • Question answering

  8. Levels of corpus annotation • Sentence and word segmentation • Part-of-speech (POS) tagging • Syntactic parsing • Semantic, pragmatic and discourse annotation • Learner corpora: error annotation

  9. Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation

  10. Corpus querying and analysis • Using windows- or web-based software • Good for processing raw corpora • Word frequency, concordances, lexical bundles, and keyword lists • Examples: AntConc and GOLD • Using natural language processing tools • Good for processing annotated corpora • Extracting occurrences of grammatical patterns • Examples: Stanford parser and Tregex

  11. Interpreting corpus data • Are frequency differences statistically significant? • w appears x times in an n-word corpus, and y times in an m-word corpus • Chi-square test • Fisher’s Exact Test

  12. Interpreting corpus data (cont.) • Collocation analysis • How strongly are x and y associated • Mutual information - Measures difference between observed and expected frequencies of (X,Y) • T-test - Measures confidence with which to claim strong association between X and Y

  13. Resources • Books • Hunston (2002): Corpora in Applied Linguistics • McEnery (2006): Corpus-Based Language Studies • Journals • International Journal of Corpus Linguistics • Corpora • Websites and mailing lists • Bookmarks for corpus-based linguists • Linguistic data consortium • The corpora list

  14. Resources • Corpus annotation and analysis tools • Stanford Natural Language Processing Group • Places for exploration • MICASE • BNC Online • Courses on corpus linguistics • Computational and Statistical Methods for Corpus Analysis (Summer 2009) • Seminar on Applied Corpus Linguistics (Fall 2009)

  15. Note on research project design • Purpose of project • Corpus compilation and annotation • Corpus analysis • Bottom-up: from observations of recurring patterns to hypothesis and generalizations • Top-down: start with given categories and search for evidence of use and variance • Caution on generalizability

More Related