1 / 18

Corpus design

Corpus design. See G Kennedy, Introduction to Corpus Linguistics , Ch .2 CF Meyer, English Corpus Linguistics , Ch. 2. What is a corpus?. Corpus (pl. corpora) = ‘body’ Collection of written text or transcribed speech Usually but not necessarily purposefully collected

socrates
Download Presentation

Corpus design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2 CF Meyer, English Corpus Linguistics, Ch. 2

  2. What is a corpus? • Corpus (pl. corpora) = ‘body’ • Collection of written text or transcribed speech • Usually but not necessarily purposefully collected • Usually but not necessarily structured • Usually but not necessarily annotated • (Usually stored on and accessible via computer) • Corpus ~ text archive

  3. Issues in corpus design • General purpose vs specialized • Dynamic (monitor) vs static • Representativeness and balance • Size • Storage and access • Permission • Text capture and markup • Organizations

  4. General purpose vs specialized • Probably obvious how to assemble specialized corpus: appropriateness of texts for inclusion is self-defined • General-purpose corpus implies very careful planning to ensure balance • Implies making some assumptions about the nature of language, even though (as corpus linguists) that may go against the grain

  5. Dynamic vs static • Static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes (eg Brown corpus now of historical interest, in some respects BNC already out of date) • Dynamic corpus ever-changing • Called “monitor” corpus because allows us to monitor langauge change over time • But more or less impossible to ensure balance

  6. Planned balance: example of BNC • Sampling and representativeness very difficult to ensure • BNCdesigners very explicit about their assumptions • Acknowledge that many decisions are subjective in the end • 100 m words of contemporary spoken and written British English • Representative of BrE “as a whole” • Balanced with regard to genre, subject matter and style • Also designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools)

  7. BNC • 4,124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative • Amount of fiction is slightly disproportionately high compared to amount published during the sampling period, justified because of cultural importance of fiction and creative writing

  8. Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject   Number of texts % of total written Imaginative 625 22 World affairs 453 18 Social science 510 15 Leisure 374 11 Applied science 364 8 Commerce 284 8 Arts 259 8 Natural science 144 4 Belief & thought 146 3 Unclassified 50 3

  9. Sources of written material • 60% books • 25% periodicals • 5% brochures and other ephemera • eg bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken)

  10. Register “levels” • 30% literary or technical “high” • 45% “middle” • 25% informal “low” • Obvious difficulty of how to judge levels a priori

  11. Spoken corpus • Context-governed material • Lectures, tutorials, classrooms • News reports • Product demonstrations, consultations, interviews • Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions

  12. Spoken corpus • Ordinary conversation • 2000 hrs from 124 volunteers, 38 different regions • Four different socio-economic groupings • Equal male and female, age range 15 to 60+ • All conversations over a 2-day period recorded • No secret recording, and allowed to erase • Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • include false starts, hesitations, etc. • some paralinguistic features (shouting, whispering), • use of dialect words/grammar • but no phonetic information

  13. Another example: ICE • Collection of samples of English as spoken/written around the world • Common design (as well as common annotation scheme, and shared tools for exploitation) • 500 texts of approximately 2,000 words each • 60% spoken, 40% written • Specific domains and genres prescribed • Prescribing common design in this way makes the corpora comparable

  14. ICE text categories Each sample should be 2000 words

  15. Length of corpus • Resources available to create and manage corpus determine how long it can be • Funding, researchers, computing facilities • Speech is easy to capture, but much more time-consuming to process that written language • Transcription and annotation requires 6 person-hours per 1 minute of speech (Santa Barbara Corpus of Spoken American English) • 4 person-hours per 1,000 words of written sample, but between 5 and 10 person-hours per 1,000 words of speech (more for dialogues due to overlapping speech) (International Corpus of English) • On this basis, American component of ICE would take one researcher working 40 hrs/week 3 years to complete • BNC is 100 times bigger than that

  16. Length of corpus • Length is also determined on use to which it will be put • Corpora for lexicographic use need to be (much) bigger • Early corpora (1m words) seemed huge, mainly due to limitations of computers to process them • Sinclair (1991) described a 20m word corpus as “small but nevertheless useful” • Even in a billion-word corpus, data for some words/constructions would be sparse • How many tokens of a linguistic item are needed for descriptive adequacy? • Typically 40-50% of all word types occur only once in a given text (or corpus) • For polysemous words at least half of the possible meanings will occur only once (if at all)

  17. “Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types?

  18. Some attempts to base corpus size on known statistics of existing corpora • Biber (1993): “reliable information” on frequently occurring linguistic items such as nouns can be got from 120k-word sample, while an infrequently occurring construction such as conditional clause would need 2.4m words • How are such figures arrived at? • Observe point at which measures stabilise • Also, how much data can a lexicographer absorb?

More Related