Readingcorp a corpus based approach to teaching russian for research
1 / 33

ReadingCorp : a corpus-based approach to teaching Russian for Research - PowerPoint PPT Presentation

  • Uploaded on

ReadingCorp : a corpus-based approach to teaching Russian for Research. James Wilson University of Leeds [email protected] Structure of presentation. Part 1: “The Problem” (How do we teach ab -initio students to read authentic Russian texts in a year?)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'ReadingCorp : a corpus-based approach to teaching Russian for Research' - madaline-dunlap

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Readingcorp a corpus based approach to teaching russian for research

ReadingCorp: a corpus-based approach to teaching Russian for Research

James Wilson

University of Leeds

[email protected]

Structure of presentation
Structure of presentation

  • Part 1: “The Problem” (How do we teach ab-initio students to read authentic Russian texts in a year?)

  • Part 2: “A potential corpus-based solution”

  • The use of corpora and corpus tools to train ab-initio students to read authentic academic texts

  • ReadingCorp project

  • Motivated by the demand for specialist PG language training in Russian and the findings of previous research (Russian for Research 2008)

Russian for research project
Russian for Research project

  • 6-month project funded by the Centre for East European Language Based Area Studies (CEELBAS) and carried out at the University of Sheffield in 2008

  • The project aimed to:

    • build up a profile of what PG language training was offered at CEELBAS institutions and to identify the methods of and problems in teaching languages for research;

    • identify the demand for language training for research purposes at member departments and to establish what such language training should include;

    • look at new modes of delivery such as distance- and computer-aided learning and the possibility of sharing of resources.

Background information
Background information

  • Departments of Russian and Slavonic Studies are attracting more PG students who do not know Russian and whose research is therefore restricted (the same situation is true of other languages)

  • Students are unable to read primary sources, use archives and work with some online packages without Russian

  • You simply can’t do Russian-related economic research without Russian”; “Without language skills research is much impaired”


  • There is a “massive” demand for PG language training across CEELBAS institutions

  • Potentially good researchers are being lost due to the lack of adequate PG language training

  • Conventional PG-focused intensive courses are effective but impractical at most institutions; they are not financially sustainable at any institution in the long term

  • Other methods (“piggy-backing”, non-intensive reading modules, following UG programmes) do not work

  • It is not possible to offer specialist tuition to the individual student or to cover all research areas

  • Texts are out-dated and/or more suited to some disciplines than others; their content is determined subjectively by linguists

  • A cost-effective way of delivering shared PG language programmes is necessary

A corpus based solution
A corpus-based solution???

  • Corpora are well suited to LSP learning and teaching for several reasons:

    • they can inform us of key items of vocabulary and grammar points that require instruction in specific domains;

    • frequency data shape materials and syllabus design;

    • breadth of topics: a corpus can be created on any topic, no matter how specialist, for which there is enough available material;

    • needs of the individual: a corpus can be created from articles directly relevant to an individual student’s research topic;

    • there is no printing/publication lag: corpora can be created on current events, yesterday’s news stories, etc.;

    • they can be built within hours.

A corpus based solution 2
A corpus-based solution??? (2)

  • Corpora can be used directly or indirectly

  • Corpora can be used in combination with traditional teaching practices (blended learning)

  • Corpora have been used successfully for language for research projects in the past: German for Chemists (Butler) and on the Warwick course of Italian Language for PG students of Renaissance Studies

Project description
Project description

  • 2-year project funded by the AHRC (Collaborative Language Skills Training project)

  • Run at the Department of Russian and Slavonic Studies (Sheffield), GRASS and CTS (Leeds)

  • Combines knowledge and practice of PG language teaching methods (Sheffield / Leeds) with technological expertise in creating corpus tools for language learning purposes (Leeds)


  • To explore possibilities for using corpora to achieve reading competence in Russian

  • To create tools, reference materials (keyword lists, annotated readers, a grammar for researchers) and exercises to support the acquisition of vocabulary from specific and varied domains

  • To actively engage students in “vocabulary identification” exercises

Putting our goals into perspective
Putting our goals into perspective

  • It may seem “ridiculous” to suggest that a complete beginner with no formal training in linguistics or experience in learning a foreign language can learn Russian in a year

  • We focus solely on reading skills

  • Our aim is for students to read authentic texts with the help of dictionaries and our tools and materials - we do not expect them to pick up a text and read it as someone with years of training would

  • Why within a year?

Corpora tools and materials
Corpora, tools and materials

  • Corpus

    • The Russian Academic Corpus (RAC)

  • Technology (additions to the IntelliText Interface)

    • Keyword list generator (single- and multi-words; POS-specific)

    • Grammar frequency

    • Advanced options for navigating texts

    • Vocabulary highlights (general academic, discipline-specific keywords)

    • Automatic grammar classification

  • Pedagogy

    • Readers from 13 academic disciplines

    • “Cleaned” keyword lists from 13 academic disciplines

    • Transferable teaching materials

    • A PG-focused grammar

The russian academic corpus rac
The Russian Academic Corpus (RAC)

  • Contains approximately 5 million words

  • Used for compiling frequency lists and in teaching

  • Made up of 13 sub-corpora (art, criminology, culture, ecology, economics, geography, history, international relations, linguistics, medicine, politics, religion, sociology)

  • The sub-corpora are roughly equal in size and each contains 50 texts

  • The “main” corpus is freely available via the IntelliText Interface

  • Individual sub-corpora are available on demand

Keyword lists
Keyword lists

  • “General academic” and “discipline-specific” keywords were extracted

  • Single words (discipline-specific) and multi-words (general academic and discipline-specific)

  • “cleaned”: anomalies removed; lemmas changed to original form (то не менее > тем не менее,по отношение к>по отношению к)

  • 100 keywords for each subject area

  • Translations (all lists) and collocations (single words)


  • 10 readers from each of the 13 sub-corpora

  • Each text contains approximately 200 words

  • The readers may be used to train general academic vocabulary or discipline-specific vocabulary

  • Manually annotated

  • Freely available

Sample reader
Sample reader

  • Криминогенностьличности представляет собой качественной выражение соотношения негативной и позитивной направленности личности. А преступление является объективным, реальным показателем криминогенности личности. Криминогенностьможно рассматривать с двух позиций. Исходя из первой, «криминогенность рождается и умирает вместе с преступлением». Однако криминогенностьможно рассматривать не только как результат, но и как процесс ее становления. Таким образом, можно выделить три стадии генезиса криминогенности личности преступника: Формирование криминогенности личности, которая в этот период совершает аморальные поступки и правонарушения неуголовногохарактера.


  • Focus on “receptive” not “productive” language skills

  • Grammar identification: our aim is for users to identify and understand the use of grammatical features, with our notes and tools, not to be able to construct them

  • Grammar forms were selected on the basis of their frequency in academic texts: participles, gerunds and passive constructions were introduced early; some points of grammar commonly covered in the first year of UG programmes were not included.

Grammar 2
Grammar 2

  • The following information is included for each point of grammar:

    • an English-language commentary of how and for what purpose it is used;

    • information on what the form looks like (identification);

    • lists of other points of grammar that have the same form and notes on how to tell them apart (disambiguation);

    • an annotated list of common words within the category;

    • corpus examples and translations.

Example imperfective gerunds
Example (imperfective gerunds)

  • Use: -ing forms: judging by his comments, I’d say that ...

  • Looks like: принимая ,судя, опираясь

  • Common exceptions: будучи

  • Can be confused with: soft feminine nouns (Nom. Sing.) = неделя, hard feminine adjectives (Nom. Sing.) = интересная; soft masculine nouns (Gen. Sing.) = трамвая

  • Disambiguation: gerunds are very unlikely to be directly preceded by words ending in –аяor –ого; words ending in –a rarely follow gerunds (BUT принимая лекарства)

Common forms imperfective gerunds
Common forms (imperfective gerunds)

Reading texts with our tools
Reading texts with our tools

  • For texts that are available online or that have been digitised

  • The ReadingCorp tools allow users to annotate their texts according to vocabulary and grammar

  • Vocabulary highlights work for any text uploaded to the system, as the list of academic words is stable and our tools automatically classify texts and corpora according to keywords

  • Automatic grammar classification helps users identify or disambiguate parts of speech

  • Demo with “Space” corpus

Teaching methodology and materials
Teaching methodology and materials

  • Initial corpus training (either one session over an afternoon or two shorter sessions)

  • Introduction to the Cyrillic alphabet (if necessary)

  • 1 class a week focusing on (1) guided reading and (2) hands-on vocabulary building exercises

  • Exercises are based around keywords

Transferability of resources
“Transferability” of resources

  • Tutors working with students whose research is in an area other than those covered by ReadingCorp may:

    • use our interface to create keyword lists and analyse texts

    • use the readers for general reading practice

    • access the RAC

    • use the grammar

    • use the keyword lists from the RAC

  • They will need to:

    • create keywords lists for the subject by building a small corpus

    • add their own examples to the material templates

How does a corpus approach address the ceelbas issues
How does a corpus approach address the CEELBAS issues?

  • Is/Does a corpus-based approach:

    • suitable for distance learning? 

    • cover contemporary research topics? 

    • cost-effective and sustainable? 

    • transferable to other languages and domains? 

    • cater for the needs of the individual student? 

    • help structure syllabi? 

    • allow ab-initio students to acquire the necessary reading skills to be able to effectively carry out their research?


  • Corpora go beyond the traditional course book and offer exciting possibilities for LSP learning and teaching

  • A corpus-based approach is particularly well-suited to training reading competence in specific domains

    • It makes the goal of reading and understanding authentic academic texts in Russian within a year a realistic objective

  • BUT will advances in machine translation and optical character recognition make specialised reading courses redundant? As machine translation becomes more reliable, as more material is digitised and made available online and as OCR technology becomes more accurate, will students need anything other than a scanner and Google Translate?