1 / 25

English Corpora and Language Learning

English Corpora and Language Learning. Tamás Váradi varadi@nytud.hu. Outline. What is a Corpus? Compiling a corpus First generation of corpora: BROWN, LOB The Age of Mega Corpora British National Corpus International Corpus of English International Corpus of Learner English

may-gibson
Download Presentation

English Corpora and Language Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. English Corpora and Language Learning Tamás Váradi varadi@nytud.hu

  2. Outline • What is a Corpus? • Compiling a corpus • First generation of corpora: BROWN, LOB • The Age of Mega Corpora • British National Corpus • International Corpus of English • International Corpus of Learner English • The Web as a corpus? • Availability English Corpora and Language Learning

  3. Corpora? (1) A collection of texts especially if complete and self contained; the corpus of Anglo-Saxon verse (2) In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language and usually stored as an electronic database (The Oxford Companion to the English Language 1992) A collection of naturally occurring language text chosen to characterize a state or variety of a language John Sinclair Corpus Concordance Collocation OUP 1991 English Corpora and Language Learning

  4. The pre-electronic era • Huge, painstaking manual effort • Covering a closed body of texts • Bible Concordance • Shakespeare Concordance • Attempt to capture the whole language English Corpora and Language Learning

  5. Compiling a corpus • Aim • provide solid empirical evidence about language • Design • geographical and chronological bounds • speakers, genres, • defined by future use • Representative corpora? • Annotation • Output English Corpora and Language Learning

  6. Corpus Linguistics: the early phase • Early Sixties • BROWN Corpus 500 texts of 2000 words each • LOB corpus British counterpart • Classic reference works • Part of speech tagged English Corpora and Language Learning

  7. Survey of English Usage • A major undertaking at UCL led by Sidney Greenbaum • 1 m word compilation • very careful annotation • 500 words spoken material • LONDON-LUND Corpus English Corpora and Language Learning

  8. Structure of SEU English Corpora and Language Learning

  9. LOB corpus: a sample A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._. A01 3 ^ by_IN Trevor_NP Williams_NP ._. A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A01 4 nominating_VBG any_DTI more_AP labour_NN A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._. English Corpora and Language Learning

  10. Concordance output English Corpora and Language Learning

  11. The age of Mega Corpora • COBUILD • John Sinclair at University of Birmingham • originally 20 m words • now over 300 m word BANK of English • the more the better • no fixed size: the idea of a Monitor corpus English Corpora and Language Learning

  12. A major undertaking in the mid-nineties • Birmingham, Lancaster – OUP,Longman,Chambers • 100 m words carefully compiled • 10 m words spoken data ! • up-to-date standarg SGML encoding • still the paradigm example of a reference corpus English Corpora and Language Learning

  13. Accessing the BNC English Corpora and Language Learning

  14. BNC-Baby English Corpora and Language Learning

  15. Searching LOB/BROWN English Corpora and Language Learning

  16. International Corpus of English • A network of corpora corvering regional variaties of English • Project organized by UCL London • Each containing cc. 1 m. words • GB, Hong-Kong Australia, East-Africa more in preparation English Corpora and Language Learning

  17. ICE-HK English Corpora and Language Learning

  18. ICE-GB: sociolinguistic variation English Corpora and Language Learning

  19. ICE-GB: syntactic annotation English Corpora and Language Learning

  20. Treebanks • Geoffrey Sampson • Meticulously hand-crafted syntactic annotation • SUSANNE • CHRISTINE • LUCY • Penn-Treebank • University of Pennsyvania • Massive amounts of utomatically annotated data aimed for natural language processing work English Corpora and Language Learning

  21. International Corpus of Learner English • International Centre of English Corpus Linguistics Catholic University of Louvain led by Sylviane Granger • collection of essays • student profiles • Hungarian-English in preparation English Corpora and Language Learning

  22. Susanne Corpus • Aims of the Scheme • comprehensive — covering all features of surface and logical English grammar that are definite enough to be susceptible of formal annotation, and including all phenomena that occur in practice in modern English • explicit — if two researchers at separate sites are given the same sample of English and asked to annotate it according to the SUSANNE standards, their annotations should be identical • nonpartisan — where aspects of grammar are the subject of theoretical controversy, the SUSANNE scheme aims to embody a neutral analysis which rival theoreticians can interpret in their own preferred terms English Corpora and Language Learning

  23. The Web as a corpus • Why sample when you can access the whole? • Huge and ever changing • The ultimate in authenticity? • Not necessarily … English Corpora and Language Learning

  24. The Webcorp project English Corpora and Language Learning

  25. http://devoted.to/corpora English Corpora and Language Learning

More Related