Tracking Language Development with Learner Corpora

Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010

Outline Corpora and learner corpora Graphic Online Language Diagnostic (GOLD) 2

Corpora and learner corpora What is a corpus Types of corpora Corpus design and compilation Corpus annotation Corpus querying and analysis Learner corpora and L2 development Resources 3

What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterize a state or a variety of language • Sinclair (2004): • a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

Types of corpora • General-purpose vs. specialized corpora • The British National Corpus • Michigan Corpus of Academic Spoken English • Native vs. learner corpora • International Corpus of Learner English • Monolingual vs. parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer

Types of corpora (cont.) • Corpora representing one or diverse varieties • International Corpus of English • Synchronic vs. diachronic corpora • Spoken vs. written corpora

Corpus design • Purpose and type of corpus • Spoken/written; cross-sectional/longitudinal • External criteria for content selection • Communicative function of a text • Mode, medium, interaction, domain, topic • Representativeness, balance, size, sampling • Design of the BNC

Corpus design (cont.) Encoding meaningful metadata information Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating … Facilitates contrastive and longitudinal studies MICASE speaker and transcript attributes 8

Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation • Standards and encoding

Why annotate • For linguistic research • Allow more effective corpus searches • For natural language processing • Spelling and grammar checking • Machine translation

Levels of corpus annotation • Sentence and word segmentation • Part-of-speech (POS) tagging and lemmatization • Syntactic parsing • Semantic, pragmatic, and discourse tagging • Learner corpora: error annotation • Project-specific annotation

Difficulties for corpus annotation • Ambiguity • I saw a pig with binoculars. • Problems for tagging, parsing, & WSD • Unknown words • Identification • POS tagging • Semantic annotation

Standards and encoding Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w> 13

Corpus querying and analysis • Using windows- or web-based software • Good for processing raw corpora • Word frequency, concordances, lexical bundles, and keyword lists • Examples: AntConc and GOLD • Using natural language processing tools • Good for processing annotated corpora • Extracting occurrences of grammatical patterns • Examples: Stanford parser and Tregex

Resources • Books and journals • Hunston (2002): Corpora in Applied Linguistics • McEnery (2006): Corpus-Based Language Studies • International Journal of Corpus Linguistics • Corpus Linguistics and Linguistic Theory • Corpora • Websites and mailing lists • Bookmarks for corpus-based linguists • Linguistic data consortium • The corpora list • Stanford Natural Language Processing Group

Learner corpora and L2 development Samples from same students at different times Did (targeted) language development take place? Was a particular pedagogical intervention effective? Samples from different students What areas do students show different levels of development? What factors affect students’ language development? 16

Graphic Online Language Diagnostic A free online tool for teachers to assess their students’ language development Developed at CALPER, Penn State, funded by DOE Project co-directors: Xiaofei Lu and Michael McCarthy Teachers can use GOLD to Compile, upload, and manage their own corpora Share corpora with each other Search and analyze corpora Demonstration 17

Corpus compilation A user can compile a corpus by Directly compiling and uploading an XML file Using the easy-to-use guided XML creation interface An uploaded corpus can be easily managed Documents can be added or deleted The whole corpus can be deleted Content and metadata of individual documents can be easily accessed 18

Corpus sharing GOLD facilitates easy data sharing A corpus may be set to be Private, shared, or public Corpus owner may give other users right to View, add, edit, or delete corpora Demonstration 19

Basic corpus information Word count Alphabetic or numeric order Can be downloaded as a text file Corpus and document statistics Mean sentence length Mean word length Type-token ratio Demonstration 20

Corpus search Select one or more corpora to search Specify key words or phrases May use the wildcard character, e.g. book* Specify contexts Size of context window Context words and their positions Specify metadata conditions 21

Corpus search results Display of search results Sortable KWIC display of search results Sortable graphic display of search results Demonstration 22

Lexical bundle/collocation search Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions Search results Sortable list of n-grams found in selected corpora Demonstration 23

Summary of features Difference from other online tools Can create, share, and search multiple corpora Can easily search subsets of data Can work with any language Summary of corpus analysis functions Word list Corpus and document statistics: mean sentence length, mean word length, type-token ratio Corpus search and collocation search 24

Sample questions to ask With data from an individual student, one can either describe or track development in Patterns of usages of words and phrases – frequency, underuse, overuse, etc. Lexical and syntactic complexity Appropriate usage of words and phrases in context Patterns of usages of lexical bundles 25

Sample questions to ask (cont.) With data from different (groups of) students, one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency, underuse, overuse, etc. Lexical and syntactic complexity Appropriate usage of words and phrases in context Patterns of usages of lexical bundles 26

Future enhancements Corpora for benchmarking Multilingual natural language processing Suggestions on desirable functions welcome 27

Tracking Language Development with Learner Corpora

Tracking Language Development with Learner Corpora

Presentation Transcript

Learner Language

Corpora and Language Teaching

Corpora, Language Technology and Maltese

Using Corpora for Language Research

Creating Learner Corpora from CAST

Creating Learner Corpora from CAST

Tracking Linguistic Variation in Historical Corpora

Querying Spoken Language Corpora

Linguistic annotation of learner corpora

LEARNER LANGUAGE

Corpora in language variation studies

Corpora in language education

Stages of learner language development

4- Learner Language

Chinese learner corpora and second language research

Using Corpora for Language Research

Corpora in language variation studies

Corpora in language variation studies

Corpora in language education

The English Language Learner