210 likes | 289 Views
This overview delves into corpus linguistics research methods, covering corpus types, design, analysis, and future courses, with details on annotating corpora, obtaining linguistic data, annotation tools, and interpretative analysis levels.
E N D
Research methods in corpus linguistics Xiaofei Lu
Overview • What is a corpus? • Types of corpora • Corpus design • Where to obtain corpora • Corpus annotation • Corpus analysis • Note on research project design • Exercises and demos in between • Future courses on corpus linguistics
What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Francis (1982): • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
Types of corpora • General-purpose monolingual corpora • The British National Corpus • Specialized corpora • Lancaster Corpus of Academic Written English • Learner corpora • International Corpus of Learner English • Parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer • Corpora and varieties • International Corpus of English • Synchronic and diachronic corpora
Corpus design • Purpose • Comparability • Type • Content: mode, interaction, domain, medium • Structure: proportions • Size • Sampling? • Design of the BNC
Where to obtain corpora • Linguistic data consortium • Bookmarks for corpus-based linguists • Ask on the corpora list • Compile your own corpora • Design your corpus • Getting permission • File format, metadata, and data markup • Text capture • Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX • Transcription tools, e.g., Transcriber • A Guide to Good Practice
Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation • Tools for corpus annotation
Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Text summarization • Machine translation • Question answering
Levels of corpus annotation • Sentence segmentation • Word segmentation/tokenization • Part-of-speech (POS) tagging • Chunking/shallow parsing • Syntactic parsing • Semantic annotation • Pragmatic annotation • Parallel corpora: sentence alignment • Learner corpora: error annotation
Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation
Tools for corpus annotation • Bookmarks for corpus-based linguists • Corpora and Corpus Annotation Tools on the WWW • POS tagger demonstration • Sentence segmentation • POS tagging • Extracting NPs of the form DT NN NN • Dexter: Tools for analyzing language data
Corpus analysis • Levels of corpus analysis • Tools for corpus analysis • Interpreting corpus data
Levels of corpus analysis • Word frequency lists • Concordances • Collocation (lexical patterning) • Colligation (syntactic patterning) • Keyword lists
Tools for corpus analysis • Bookmarks for corpus-based linguists • Recommendations: • WordSmith Tools (not free) • AntConc (free) • TextStat (free) • Unix tools • Write your own scripts
Exercise (part 1) • Download and install AntConc • Download some text for processing • Project Gutenberg • Generate a word frequency list for your mini-corpus
Interpreting corpus data • Are frequency differences statistically significant? • w appears x times in an n-word corpus, and y times in an m-word corpus • Chi-square test (doesn’t work well for small numbers) • Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2)
Exercise (part 2) • Compare your word frequency list with that of BNC • Anything interesting? • Run the chi-square test and Fisher’s Exact test on some interesting words
Interpreting corpus data (cont.) • Collocational analysis: How strongly are x and y associated • Mutual information • Measures difference between observed and expected frequencies of (X,Y) • Higher MI, stronger association • Doesn’t work well for low frequencies • T-test • Measures confidence with which to claim strong association between X and Y • Higher t-score, higher association • Online calculations
Exercise (part 3) • Generate a concordance for a target word • Find a word that co-occurs frequently with the target word • Test if the word is strongly associated with the target word
Note on research project design • Purpose of project • Corpus compilation and annotation • Corpus analysis • Bottom-up: from observations of recurring patterns to hypothesis and generalizations • Top-down: start with given categories and search for evidence of use and variance • Caution on generalizability
Future courses on corpus linguistics • Spring 2007 • APLING 597E: Introduction to Corpus Linguistics • Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis • Spring 2008 • APLING 597: Seminar on Corpus Linguistics • Advanced seminar on using corpora for serious research projects