lela 30922 english corpus linguistics l.
Skip this Video
Loading SlideShow in 5 Seconds..
LELA 30922 English Corpus Linguistics PowerPoint Presentation
Download Presentation
LELA 30922 English Corpus Linguistics

Loading in 2 Seconds...

play fullscreen
1 / 17

LELA 30922 English Corpus Linguistics - PowerPoint PPT Presentation

  • Uploaded on

LELA 30922 English Corpus Linguistics. Harold Somers Professor of Language Engineering Office: Lamb 1.15. Syllabus. Assessment. A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'LELA 30922 English Corpus Linguistics' - kitty

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lela 30922 english corpus linguistics

LELA 30922English Corpus Linguistics

Harold Somers

Professor of Language Engineering

Office: Lamb 1.15


A practical project in which students will use the BNC (or other approved corpus material) to investigate some question of English language usage.

Suggestion: base your project (more or less closely) on some existing study.

Project write-up will include relevant background material and results and discussion of a corpus-based analysis.

In other words: summarize (and criticize) the chosen study, then do your own version, and compare the results

reading matter
Reading matter
  • Main recommendations:
    • Kennedy, G.D. (1998) An introduction to corpus linguistics. London: Longman.
    • McEnery, T. & A. Wilson (2001, 2nd ed) Corpus linguistics. Edinburgh: Edinburgh University Press.
    • Meyer, C. (2002) English corpus Linguistics: An introduction. Cambridge: Cambridge University Press.
  • Lots of other books, focussing on particular aspects
  • Do not ignore journals (Int J Corp Ling) and specialist conferences, especially when considering practical assignment.
  • http://tinyurl.com/32abhb for list of resources available at UoM
what is a corpus
What is a corpus?
  • Corpus (pl. corpora) = ‘body’
  • Collection of written text or transcribed speech
  • Usually but not necessarily purposefully collected
  • Usually but not necessarily structured
  • Usually but not necessarily annotated
  • (Usually stored on and accessible via computer)
  • Corpus ~ text archive
computers and corpus linguistics
Computers and corpus linguistics
  • Historically, manual analysis of large bodies of text (esp. in literary and biblical studies)
    • Error-prone, time-consuming, not verifiable
  • Computers have introduced
    • Reliability, accuracy and replicability
    • increased speed and capacity means you can do more on a grander scale
    • new tools mean you can do things you might not have thought of doing
what is corpus linguistics
What is corpus linguistics?
  • Not a branch of linguistics, like socio~, psycho~, …
  • Not a theory of linguistics
  • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
evidence in linguistics
Evidence in linguistics
  • Real attested usage as linguistic evidence
  • Contrasts with introspective approach previously typical
  • Relates to the competence~performance (langue~parole) distinction
  • Corpus linguists often more interested in trends than rules (probabilities rather than certainties)
  • Famous stories of corpus evidence contradicting widely-held assumptions about language use.
activities in corpus linguistics
Activities in corpus linguistics
  • Design and compilation of corpora
  • Development of tools for corpus analysis
  • Descriptive linguists using corpora to analyze lexical and grammatical behaviour of language, eg for lexicography
  • Exploiting corpora in applied linguistics – language teaching, translation.
history of corpus linguistics www essex ac uk linguistics clmt w3c corpus ling content history html
History of Corpus Linguisticswww.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/history.html
  • Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc.
  • Arrival of computers in 1950s of course changed everything
brown corpus
Brown corpus
  • First modern computer-readable corpus
  • W.N. Francis and H. Kucera, Brown University, Providence, RI
  • one million words of American English texts printed in 1961
  • sampled from 15 different text categories
  • used as model for other corpora, including …
lob corpus
LOB corpus
  • compiled by researchers in Lancaster, Oslo and Bergen
  • one million words of British English texts printed in 1961
  • sampled from same 15 text categories as Brown corpus
  • All texts ≤ 2,000 words long
  • Kolhapur corpus of Indian English compiled in 1978 to same sepcification
chomsky s criticisms
Chomsky’s criticisms
  • Chomsky’s ideas drove linguists away from empiricism (data) towards rationalism (introspection)
  • Chomsky switched focus onto abstract models of language competence
  • He was especially scathing about corpus-based approaches
    • Based on mistaken view that corpus linguists confused finiteness of data with finiteness of language
  • See McEnery & Wilson, chapter 1
the london lund corpus of spoken english llc
The London-Lund Corpus of Spoken English (LLC)
  • First corpus of transcribed spoken language
  • Part of Survey of Spoken English at Lund University under the direction of J. Svartvik
  • 500,000 words of spoken British English recorded from 1953 to 1987
  • different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration
  • 1m-word corpus too small for many applications
  • 1980: Collins instigated collection of 20m-word corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair)
  • Now expanded to Bank of English corpus, 320m words and growing
  • www.collins.co.uk/Corpus/CorpusSearch.aspx
  • www.collins.co.uk/books.aspx?group=153
bnc 1995
BNC (1995)
  • http://www.natcorp.ox.ac.uk/
  • 100m word collection of written and spoken text from 1975-93 (already dated in some respects!)
  • Carefully designed and balanced
  • Corpus is closed (finite, synchronic)
  • All text tagged to high quality
  • Lots of tools available for exploration
  • Many other corpus projects now underway, sometimes modelled on BNC or other well-known corpora
  • Various national projects
  • Specialized corpora
    • Historical texts
    • Learner English
    • International English
    • Translated English
    • Spoken dialogues for certain domains
  • When widely used, they become a kind of benchmark, eg Wall Street Journal corpus (treebank)
    • This can have pros and cons