1 / 21

Research methods in corpus linguistics

Research methods in corpus linguistics. Xiaofei Lu. Overview. What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics.

macneil
Download Presentation

Research methods in corpus linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research methods in corpus linguistics Xiaofei Lu

  2. Overview • What is a corpus? • Types of corpora • Corpus design • Where to obtain corpora • Corpus annotation • Corpus analysis • Note on research project design • Exercises and demos in between • Future courses on corpus linguistics

  3. What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Francis (1982): • a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterise a state or a variety of language

  4. Types of corpora • General-purpose monolingual corpora • The British National Corpus • Specialized corpora • Lancaster Corpus of Academic Written English • Learner corpora • International Corpus of Learner English • Parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer • Corpora and varieties • International Corpus of English • Synchronic and diachronic corpora

  5. Corpus design • Purpose • Comparability • Type • Content: mode, interaction, domain, medium • Structure: proportions • Size • Sampling? • Design of the BNC

  6. Where to obtain corpora • Linguistic data consortium • Bookmarks for corpus-based linguists • Ask on the corpora list • Compile your own corpora • Design your corpus • Getting permission • File format, metadata, and data markup • Text capture • Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX • Transcription tools, e.g., Transcriber • A Guide to Good Practice

  7. Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation • Tools for corpus annotation

  8. Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Text summarization • Machine translation • Question answering

  9. Levels of corpus annotation • Sentence segmentation • Word segmentation/tokenization • Part-of-speech (POS) tagging • Chunking/shallow parsing • Syntactic parsing • Semantic annotation • Pragmatic annotation • Parallel corpora: sentence alignment • Learner corpora: error annotation

  10. Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation

  11. Tools for corpus annotation • Bookmarks for corpus-based linguists • Corpora and Corpus Annotation Tools on the WWW • POS tagger demonstration • Sentence segmentation • POS tagging • Extracting NPs of the form DT NN NN • Dexter: Tools for analyzing language data

  12. Corpus analysis • Levels of corpus analysis • Tools for corpus analysis • Interpreting corpus data

  13. Levels of corpus analysis • Word frequency lists • Concordances • Collocation (lexical patterning) • Colligation (syntactic patterning) • Keyword lists

  14. Tools for corpus analysis • Bookmarks for corpus-based linguists • Recommendations: • WordSmith Tools (not free) • AntConc (free) • TextStat (free) • Unix tools • Write your own scripts

  15. Exercise (part 1) • Download and install AntConc • Download some text for processing • Project Gutenberg • Generate a word frequency list for your mini-corpus

  16. Interpreting corpus data • Are frequency differences statistically significant? • w appears x times in an n-word corpus, and y times in an m-word corpus • Chi-square test (doesn’t work well for small numbers) • Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2)

  17. Exercise (part 2) • Compare your word frequency list with that of BNC • Anything interesting? • Run the chi-square test and Fisher’s Exact test on some interesting words

  18. Interpreting corpus data (cont.) • Collocational analysis: How strongly are x and y associated • Mutual information • Measures difference between observed and expected frequencies of (X,Y) • Higher MI, stronger association • Doesn’t work well for low frequencies • T-test • Measures confidence with which to claim strong association between X and Y • Higher t-score, higher association • Online calculations

  19. Exercise (part 3) • Generate a concordance for a target word • Find a word that co-occurs frequently with the target word • Test if the word is strongly associated with the target word

  20. Note on research project design • Purpose of project • Corpus compilation and annotation • Corpus analysis • Bottom-up: from observations of recurring patterns to hypothesis and generalizations • Top-down: start with given categories and search for evidence of use and variance • Caution on generalizability

  21. Future courses on corpus linguistics • Spring 2007 • APLING 597E: Introduction to Corpus Linguistics • Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis • Spring 2008 • APLING 597: Seminar on Corpus Linguistics • Advanced seminar on using corpora for serious research projects

More Related