1 / 51

Principles of corpus construction

Principles of corpus construction. Matthew Brook O’Donnell. University of Liverpool - Corpus Linguistics Summer Institute 2008 . Aims. What is a corpus? What principles guide the construction, development and selection of a corpus? When and How to build a corpus

Lucy
Download Presentation

Principles of corpus construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of corpus construction Matthew Brook O’Donnell University of Liverpool - Corpus Linguistics Summer Institute 2008

  2. Aims • What is a corpus? • What principles guide the construction, development and selection of a corpus? • When and How to build a corpus • Can the web be used for building corpora? • Workshop: Build a small corpus of web texts

  3. What is a corpus? • John Sinclair (1933-2007)

  4. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  5. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  6. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  7. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  8. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  9. What is a corpus? • John Sinclair (1933-2007) A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

  10. Corpus • Authentic language data • Electronic/machine readable form • Designed and collected according to sampling procedures • Representative of language • For linguistic investigation

  11. Corpus • Authentic language data • Electronic/machine readable form • Designed and collected according to sampling procedures • Representative of language • For linguistic investigation

  12. Corpus • Authentic language data • Electronic/machine readable form • Designed and collected according to sampling procedures • Representative of language • For linguistic investigation

  13. But first…. • Let’s talk about food! • How could we compile a representative list (=CORPUS) of food/dishes from around the world?

  14. How can we group these foods? • Where they come from

  15. ? 1. Continental Food Corpus Europe Asia America (North & South)

  16. How can we group these foods? • Where they come from • Their main component

  17. 2. Main component corpus Fish Meat Vegetarian

  18. Where do you eat it?

  19. How can we group these foods? • Where they come from • Their main component • Where you usually eat it

  20. 3. ‘Fast food potential’ corpus Takeaway Restaurant

  21. How can we group these foods? • Where they come from • Their main component • Where it is usually eaten • What you use to eat it

  22. 4. Consumption Implement Corpus Knife& fork Chopsticks Hands

  23. How can we group these foods? • Where they come from • Their main component • Where it is usually eaten • What you use to eat it

  24. 1+2. Continental & Main Component Corpus Fish Meat Veg. Europe Asia America (North & South)

  25. Corpus • Authentic language data • Electronic/machine readable form • Designed and collected according to sampling procedures • Representative of language • For linguistic investigation

  26. Language corpora • Different types of corpus • Corpus size • Sample size • Representativeness - sampling • Classification criteria

  27. Types of corpus • Sample Corpus: a fixed sample of text, often used as a reference corpus for comparing • Monitor Corpus: a corpus which develops and is added to or filtered depending on the researcher’s needs • Mini-corpus: a small corpus (e.g. to be compared with a reference corpus) • Multilingual Corpus: corpus in a variety of languages

  28. Types of corpus • Comparable Corpus: texts in 2 languages or 2 varieties but not matched up • Parallel Corpus: texts are translations of each other, eg. Canadian Hansard, corpus of versions of Plato, Bible • Translation Corpus: 2 or more sets of texts classified as either originals or translations, the purpose being to identify features of translation (Manchester: Baker) • Diachronic Corpus: Helsinki, LOB v. FLOB • Learner Corpus: texts are written by language learners

  29. Corpus Size – Is bigger better? • 1st Generation Corpora = 1 Million Words • BROWN, LOB • ICE corpora • 2nd Generation Sample Corpora • BNC, ANC = 100 Million Words • Monitor Corpora • Bank of English (450+ million and growing!) • Specialized corpora • Depends on source and scope of problem under investigation

  30. Sample Size ‘Personally I would like to see ‘whole text’ as a default condition, thus classifying sample corpora as one of the categories of special corpora... To me the use of small samples is just a remnant of the early restraints on corpus building, and the advantages of whole texts can be set out in powerful argument. The use of samples of constant size gains only a spurious air of scientific method, since it confers no benefit on the corpus, and is as practical as Genghis Khan’s fabled policy of having all his soldiers the same height.’ (Sinclair 1995: 27-28)

  31. Sampling • Population • Production • Reception

  32. Classifying Texts • Internal Criteria • Topic (aboutness) • Register/Style

  33. Classifying Texts • Internal Criteria • Topic (aboutness) • Register/Style • External Criteria (situational parameters)

  34. Brits treat English with such disdain By Mr M. Rasheed Iqbal Published: June 30 2007 03:00 | Last updated: June 30 2007 03:00 From Mr M. Rasheed Iqbal. Sir, I agree with Henry von Blumenthal (Letters, June 23). It is very discouraging to hear news presenters saying "gonna" and "wanna" on the BBC news. We were brought up to speak English properly and it is disappointing to see Brits treat the language with such disdain. I hope the BBC will stem the tide and pull up its socks. M. Rasheed Iqbal, National Bank of Dubai, Deira, Dubai, UAE Copyright The Financial Times Limited 2007

  35. Mode • Primary Channel – Written • Format – Published (print & web) • Setting - Public

  36. Tenor • Addressee • Plurality – individual (editor) / plural (readers) • Presence – absent • Interactiveness – written correspondence/response • Shared knowledge – readers of same publication • Addressor • Demographic: Male, from Dubai, works in Bank, educated, at least bilingual? • Acknowledgement: Self-identified in text

  37. Field • Factuality – responding to actual event (TV broadcast), expressing personal opinion • Purposes – complain, express viewpoint, condemn slipping standards, correct perceived decline • Topics – use of British English on BBC, value of language education in former era

  38. When and How to build a corpus • DON’T! – use one of the available corpora • If interested in differences in conversational language in British English (age, sex, class etc. differences)… USE British National Corpus • Combine and subsample existing corpora to match your • Repurpose existing archive/collection • Any electronic texts available – results of surveys, DA/CA transcripts • Build your own! • OCR, download, extract from PDF • TYPE IT IN!!!!

  39. Using the web as source for corpora

  40. Web as corpus: Advantages • Massive (and expanding) amounts of electronic text • Whole texts • Wide reach of text-types/topics/genres • Much in the public domain • Google (etc) as corpus query tool

More Related