Investigating chinese learner english

Investigating Chinese Learner English

Centre for Linguistics and Applied Linguistics,

Guangdong University of Foreign Studies

Gui Shichun


  • The corpus consists of one million words of written compositions by 5 types of learners: senior middle-school, tertiary college English (band 4), tertiary college English (band 6), tertiary majors in English (1st and 2nd years), tertiary majors in English (3rd and 4th years). The corpus is annotated with grammatical tags (automatically) and error tags (manually).

  • It is avaiblable at achievement/ Achievement1. htm,

Areas of investigation
Areas of Investigation

  • Leech (1998) raises two specific questions in connection with the study of learner language:

    • What are the particular areas of overuse, underuse and error which native speakers of language A are prone to in learning target language T, as contrasted with native speakers of languages B, C, D . . . ?

    • What, in general, is the proportion of non-native target language behaviour (overuse, underuse, error) peculiar to native speakers of language A, as opposed to such behaviour which is shared by all learners of the language, whatever their mother tongue?

  • Contrastive study must be very careful, because the corpora under investigation are based on different types of language performance. One of the key issues is to identify the context-free variables, e.g. functional words and some of the most frequently used notional words. We believe that an annotated learner corpus is useful in the following ways:

  Identifying the words and structures that are typically underused or overused in the learner corpus;

  • Identifying the kinds of error learners at different levels are likely to commit;

  • Predicting the language proficiency of the learners;

  • Providing diagnostic information to the both the teachers and the learners.

Comparison of grammatical tags
Comparison of grammatical tags

In our POS tagging program, we used the same 133×133 matrix of tag-transition frequencies, and had CLEC grammatically tagged automatically. Then we tried to compare the grammatical tags of CLEC, Brown, and LOB.

  • The native corpora (BROWN and CLEC) are fairly consistent in terms of their grammatical tagging.

  • Chinese learners used more pronouns, but fewer determiners, prepositions, and numerals. Use of more pronouns and fewer numerals reflects the differences of subject matter between the learner corpus and the native speaker corpus, because what the learners have written are related to their personal and school life and activities. But use of fewer determiners and prepositions may have something to do with the learner problems in their writing.

  Another step forward is to study in greater details each type of the tagging scheme. Let's look at use of determinersas an example.

Some observations of chinese learners use of determiners
Some observations of Chinese learners use of determiners

  • Chinese learners used fewer determiners, but the total frequencies of ST6 learners were closer to those of the native speakers.

  • Chinese learners used fewer articles: the, no ; a, any were the most underused determiners.

  • Some tendency can be observed: the more proficient the learners, the closer is their use of some determiners to that of the native speakers. For example, quite, rather, half, all, both, these, those, many, much, next, former, and other. The last five are post-determiners, which were used much more often by native speakers. They can be considered as discourse markers. We hypothesize that they can be used for text identification as automatic scoring of learners’ compositions.

Under use of the learner lexicon
Under-use of the learner lexicon

The most frequently used lexical items are more or less context-free, and it is a suitable place to start with our analysis. They include:

  • Use of most functional words like determiners and prepositions;

  • Some of the modalor auxiliary verbs;

  • Some of the polysemous words like go, make, take, great, risk, etc;

  • Some pronouns, especially personal pronouns.

Overuse of modal verbs can may much should
Overuse of Modal Verbs (can, may,much,should)

  As observed by Biber et al , can (ability or permission), must (logical necessity) are used much more common in conversation, the overuse of can and must can be considered as an indication of Chinese learners' writing style. They make no distinction between the stylistic differences of spoken and the written forms. The materials of CET learners were collected mostly from CET test papers, yet they displayed the greatest number of uses of can, must, and should. Chinese learners tend to write down what they speak, though they may not be well versed in speaking, as is indicated by the underuse of could, have to, had better and have got to.

Comparing keyness of clec and flob
Comparing keyness of CLEC and FLOB

  By using the keyness programme of Wordsmith, we are able to identify the underuse of the learner corpus in terms of keyness,which is the classic chi-square test of significance with Yates correction for a 2 X 2 table. For better estimation of keyness, Ted Dunning's Log Likelihood test is used when contrasting long texts or a whole genre against the reference corpus. The higher the chi-square value, the greater is the difference between the frequencies of two corpora under observation.

Fewer third person pronouns
Fewer third person pronouns

This is the result of Chinese transfer, because in modern colloquial Chinese, the third person pronouns do not make any gender difference. There is no underuse of first and second person pronouns. St3-4 learners show wider discrepancies, because their compositions are mainly thematic writing.

Fewer passive voice constructions
Fewer passive voice constructions

  • This is shown by the underuse of “been” and “by”, and partially by “was” and “were”. The st3-4 group and the st5-6 group seem to follow the same tendency :

Fewer relative clauses
Fewer relative clauses

  • Chinese learners tend to use fewer relative clauses, as is shown by the underuse of wh-words. The discrepancies of st5-6 are smaller, showing that they are closer to the native speakers.

Contrastive analysis of risk and its synonyms across a few corpora
Contrastive analysis of risk and its synonyms across a few corpora

Using more danger than risk

  While the frequencies of risk, danger, threat, and hazard are fairly consistent in the native speaker corpora, the performance of Chinese learners is quite different.

    • Danger is a more generic term. The following errors are produced as a result of the generic use of “danger”:

      Fake furniture brings danger to people. (It is risky buying fake furniture.)

      Water is facing the danger of shortage. (We are facing the threat of water shortage.)

    • Their knowledge of “risk” is quite limited. They know how to use take the risk(8), at the risk(3)and to risk(6); whereas native speakers say: avoid/carry/eliminate/ignore/crease/involve/give/reduce/run/ worth/lack of/the risk; conventional/maximum/no/some/suicide/own/ unnecessary/hazard/ with/ without/ risk;

    • Chinese learners do not know how to use high risk,which is used quite often in the native speaker corpora。

Analyzing learner errors the cognitive model
Analyzing learner errors: The Cognitive Model

  • We use “error’ as a cover term for all ways of being wrong as an FL learner. Errors are results of “uncertainty” in language performance, and there are various kinds of uncertainty that can be traced back to cognition:

    False analogy: books, news > knowledges, informations

    Incomplete application of rules: development > advantagement

    Redundancy: ”这是一间三层高的建筑”>it was a three-story-tall building

    Overgeneralization: entered the classroom>returned the classroom

  • Verbal behaviour (errors as well as linguistic structures) can be considered as an emergence process, as a result of competition of cues.

  • To set up our cognitive framework of error analysis we make use of only those errors whose frequencies are well above 1% of the total. There are altogether 21 error types.

  • Errors can be divided into several levels that are equivalent to the processes of lexicalization → syntaticalization →relexicalization (Skehan,1998).

  Lexical perceptual level, also known as "substance errors" (James, 1998), and defined by MacWhinney as the level that "involves the acquisition of basic lexical structures in small areas of cortex called 'local maps'". They are related to perceptual representations, especially to memory, such as "memory failure" or "memory distortion". Typically these errors can be identified at single-word level, as

    • spellingor number errors (great> graet; information> informations) ,

      or by looking at its close neighbors as

    • absence of articles or prepositions. (the moon is∧brightest; I dressed myself in∧hurry; I sat back∧my chair).

  Lexico-grammatical (or lexical grammatical) level. Misconception of target language system. When looking at the errors of our learners, it is very difficult to isolate grammar and lexis into separate categories, because grammar does not exist on its own. James defines it as "text-level errors"; whereas MacWhinney chooses to call it the level that "involves the interaction between lexical structures in terms of 'lexical groups'". Typically these errors can be identified at the inter-word level, by looking at the word and its neighbors.

    • using the wrong parts of speech (POS errors: It is not difficulty that we can find…)

    • wrong word (substitution errors: “如果您碰到难题”>If you matchdifficult problem;People take (=pay)more attention to it)

    • wrong collocates (They must listen to the lessonmore carefully.);

    • verb agreement (People argues that euthanasia or mercy killing is humane.)

    • Reference (My auntcame to my home with his son.).

  Syntactic level. Errors can be identified at a broader context, at the sentential level. James chooses to call it "discourse-level errors", but we propose to reserve the word "discourse" for another upper level. L2 learners may often produce grammatical sentences that sound foreign. (Pawley and Syder, 1983). MacWhinney defines it as the level that "involves the processing of syntactic information across longer neural distances in 'functional neural circuits'". Syntactic errors vary from:

  capitalization (…he learned English and Russian andWrote the Civil War in France. )

  • punctuation (When playing football or basketball. You might be using 400 calories an hour.), to

  • run-on sentences (If I am not famous, it doesn’t matter, I don’t mind this.),

  • fragmentary sentences (As they do more exercises and often think deeply.) and

  • structural deficiency (During I spent my holidays in Beijing about ten years ago,…).

Meaning














Figure 9 The Cognitive Model

  Confirmatory factor analysis was conducted by using Lisrel 8.50, which shows clearly that there are 3 factors, and they are grouped under 3 categories as what have been defined. Path analysis shows that all the parameters (values of λs) of the hypothetical paths are significant except run-on sentences.

Lamda=0.28, insignificant

Correspondence analysis learners types by error types
Correspondence Analysis (learners types by error types)

Some general remarks
Some General Remarks

  On the whole, identifying errors at 3 levels seems to be working well in our cognitive model. So far we've not covered errors at the discourse level, because,

    • It is difficult to set down the standards for “native-like selection” as defined by Pawley and Syder (1983);

    • It is even more difficult for Chinese markers of errors to observe the standards.

  The grouping of errors is not as clear-cut as what we've thought. Very often the same type of error can be put into different categories or the same type of errors can occur at 3 different levels depending on the situations. We can only say this is done according to the main tendency.

  • At every level, language transfer seems to play an important role. This is because the adult learners have set up their L1 (more complete) linguistic system and are in the process of setting up another linguistic system (rather incomplete). As mature learners, when they want to express their complex thinking, they often fall back on using the linguistic system that is more familiar to them.

  • Occurrences of errors depend very much on the writing task and the learner’s certainty of fulfilling the task. They may not be an indication of the language proficiency of the learners. CET learners tend to commit more lexico-grammatical errors because their data were collected mainly from CET compositions.

  • This points to the necessity of inclusion of more learner data, so that we can have a more balanced collection of error types for further investigation.

