Objective Testing in Vocabulary Assessment

Research on vocabulary assessment Presented by Fanny Chang g99120009 May 26th, 2011

Introduction • The focus of this chapter is not so much on understanding the processes of vocabulary learning as on measuring the level of vocabulary knowledge and ability that learners have reached. • Language testing is concerned with the design of tests to assess learners for a variety of practical purposes that can be summarized under labels such as placement, diagnosis, achievement and proficiency.

Cont. • However, in practice the distinction between second language acquisition and assessment is difficult to maintain consistently, because, • On the one hand, language testing researchers have paid relatively little attention to vocabulary test • on the other hand, second language acquisition researchers working on vocabulary acquisition have often needed to develop tests as an integral part of their research design.

Objective testing • Objective tests are ones in which the learning material is divided into small units, each of which can be assessed by means of a test item with a single correct answer that can be specified in advance. Multiple-choice type • The tests are objective in the sense that they can be scored without requiring any judgment by the scorer as to whether an answer is correct or not. • In his book Measured Words, Spolsky (1995) explains how psychometric, the science of mental measurement that gave rise to objective testing after WWI. The new tests progressively displaced traditional essay examinations from the 1930s on.

Cont. • How does vocabulary become popular as a component of objective language tests? • Words could be treated as independent linguistic units with a meaning expressed by a synonym, a short defining phrase or a translation equivalent. • There was a great deal of work in the 1920s and 1930s to prepare lists of the most frequent words in English, as well as other words that were useful for the needs of particular groups of students.

Cont. • Multiple-choice vocabulary tests proved to have excellent technical characteristics, in relation to the requirements of psychometric theory. • Rather than simply measuring vocabulary knowledge, objective vocabulary tests seemed to be valid indicators of language ability in a broad sense. • As Anderson and Freebody noted, one of the most consistent findings in L1 reading has been the high correlation between tests of vocabulary and reading comprehension.

Multiple-choice vocabulary items • Limitations • They are difficult to construct, and require laborious field-testing, analysis and refinement. • The learner may know another meaning for the word, but not the one sought. • The learner may choose the right word by a process of elimination, and has in any case a 25 per cent chance of guessing the correct answer in a four-alternative format. • Items mat test students’ knowledge of distractors rather than their ability to identify an exact meaning of the target word. • The learner may miss an item either for lack of knowledge of words or lack of understanding of syntax in the distractors. • This format permits only a very limited sampling of the learner’s total vocabulary.

Cont. • Can we identify variables that influence the difficulty of multiple-choice vocabulary items for second language learners? • Goodrich (1977) undertook a study that can be seen as related to Wesche and Paribakht’s (1996) fourth criticism. Goodrich focused on the relative effectiveness of eight types of distractorin multiple-choice items. • Goodrich compared a version of his test containing only effective distractor types with one loaded with ineffective types. • He found some evidence that the former version was a better measure of the learners’ proficiency.

Cont. • What features of the target words influence the difficulty of multiple-choice vocabulary items? • From a number of separate analyses, Perkins and Linnville found that several features functioned as significant predictors. The most common ones were frequency, number of syllables and abstractness.

Validating tests of vocabulary knowledge • Writers on first language reading research over the years have pointed out that, in addition to numerous variations of the multiple-choice format, a wide range of test items and methods have been used for measuring vocabulary knowledge. • However, as Schwartz puts it, ‘there does not appear to be any rationale for choosing one measurement technique rather than another. • A number of early studies addressed these issues by administering two or more vocabulary tests to a group of students and then comparing the results by means of correlation.

Measuring vocabulary size • Reading researchers have long been interested in estimating how many words are known by native speakers of English as they grow from childhood through the school years to adult life. Reliable estimates of the number of words acquired by children at different age levels would provide a better basis for decisions about how many new words should be introduced in each unit of a learning programme. • Estimates of native-speaker vocabulary size at different ages provide a target for the acquisition of vocabulary by children entering school with little knowledge of the language used as the medium of instruction. Cummins (1981) analyzed the vocabulary-test results of foreign-born students in the Toronto school system and found that those who arrived at the age of six or older took five to sever years to achieve scores that were comparable to those of native-born students at their grade level.

Cont. • For international students, the focus of vocabulary research shifts to the question of what minimum number of words they need to know to cope with the language demands of their studies. Sutarsya, Nation and Kennedy (1994) found that knowledge of 4000 to 5000 words would be a prerequisite for understanding an undergraduate economics textbook written in English. • In many countries where English is a foreign language, university students are taught through the medium of the national langauge but they need to read English texts related to their field of study (e.g. Indonesia & Thailand). Scholars work on the assumption that, in order to read independently, laerners should know at least 95% of the running words in a text. Nation and Laufer argue that a vocabulary of at least 3000 word families is necessary to achieve this level of coverage.

Cont. • If we accept that vocabulary size has significant uses as a concept, the questions is how to measure it. • What counts as a word? • How do we choose which words to test? • How do we find out whether the selected words are known?

What counts as a word? • The larger estimates of vocabulary size for native speakers tend to be calculated on the basis of individual word forms, whereas more conservative estimates take word families as the units to be measured. • The key question here is whether, having learned the meaning of a base word, the learner is able to work out what a derived form means when it is encountered in context. • Nagy and Anderson (1984) faced this difficulty in their study to estimate how many words American children are exposed to in the books that they read in school.

Cont. • Bauer and Nation (1993) outline an alternative approach to defining membership of word families, using criteria such as the regularity, productivity and frequency of the prefixes and suffixes that are added to base words. • Therefore, the identification of the units to be counted is an important step in research on vocabulary size.

How do we choose which words to test? • It is impossible to test all the words that the native speaker of a language might know. • Researchers have typically started with a large dictionary and then drawn a sample of words representing. 1 per cent (1 in 100) of the total dictionary entries. • The next step is to test how many of the selected words are known by a group of subjects. • Finally, the test scores are multiplied by 100 to give an estimate of the total vocabulary size.

Cont. • Nation (1993b) pointed out in some detail, there are numerous traps for unwary researchers. • Dictionary headwords are not the most suitable sampling units. A single-word family may have multiple entries in the dictionary, so an estimate of vocabulary size based on headwords world be an inflated one. • The first word on every sixth page is chosen will produce a sample in which very common words are overrepresented. • There are technical questions concerning the size of sample required to make a reliable estimate of total vocabulary size.

How do we find out whether the selected words are known? • The following test formats are commonly used to find out whether the selected words are known. • multiple-choice items of various kinds • matching of words with synonyms or definitions • supplying an L1 equivalent for each L2 target word • the checklist (or yes-no) test, in which test-takers simply indicate whether they know the word or not

Assessing quality of vocabulary knowledge • Whatever the merits of vocabulary-size tests, one limitation is that they can give only a superficial indication of how well any particular word is known. • The test items could not show whether additional, derived or figurative meanings of the target word were known, and it was quite possible that the children had learned to associate the target word with its synonym without really understanding what either one meant. • Dolch and Leeds designed items that would measure what they called ‘depth of meaning’ of target words.

A cow is an animal that a. is found in zoos b. is used for racing c. gives milk d. does not have calves A disaster is ruin that happens a. suddenly b. within a year’s time c. to all people d. gradually Examples of ‘depth of meaning’

How to conceptualize it? • Henriksen (1999) proposed that we should recognize three distinct dimensions of vocabulary knowledge: 1. partial-precise knowledge 2. depth of knowledge 3. receptive-productive knowledge

How to measure it? • A common assessment procedure for measuring quality of vocabulary knowledge is an individual interview with each learner, probing how much they know about a set of target words. • Verhallen and Schoonen (1993) wanted to elicit all aspects of the target word meaning that the bilingual monolingual Dutch children might know, they asked a whole series of questions: • What does [book] mean? • What is a [book]? • How would you explain what a [book] is? • What do you see if you look at a [book]? • What kinds of [book] are there? • What kind of thing is a [book]? • What can you do with a [book]?

The construct validity of vocabulary tests • Construct refers to the particular kind of knowledge or ability that a test is designed to measure. 1. After the heavy rain, many parts of the city were_____. a. flooded b. washed c. drowned d. watered 2. At last the climbers reached the s_____ of the mountain. 3. We could see the place ______ she had the accident. a. which b. where c. whether d. what 4. It was difficult to play on the wet field. Playing _______________________.

Cont. • It is a well-established finding in testing research that the choice of test item to assess a particular skill or ability has an influence on the scores obtained. • Two major sources of influence on test scores: • the knowledge or ability represented by the construct • the testing task

Cont. • Campbell and Fiske (1959) developed a methodology known as multitrait multimethod (MTMM) construct validation, which provides a way of evaluating separately the contributions of traits and methods to test scores. • Two studies were set out to investigate whether it was possible to distinguish statistically between knowledge of vocabulary and of grammar. • Both studies were unable to show that vocabulary- or grammar existed as a separate construct (p. 97).

The role of context • In the early years of objective testing, many vocabulary tests presented the target words in isolation, in lists or as the stems of multiple-choice items. Pure measures of vocabulary knowledge • One practical difficulty with testing vocabulary in isolation was recognized early on: a word can have different meanings and be used as more than one part of speech. • One approach which has generally been recommended in the handbooks on language testing was to present the word in a short phrase or sentence to cue the intended usage.

Cont. • Stalnaker and Kurath (1935), compared two methods of testing knowledge of German vocabulary. • 1. bekommen 1-become 2-arrive 3-accept 4-escape 5-receive 2. versuchen 1-attempt 2-search 3-request 4-conceal 5-visit • Ein Mann hatte drei erwachsene erwachsene ________; Sohne. Diese arbeiteten fast nie, fast_______; obgleich der vater ihnen befohlen obgleich _________; befohlen __________; hatte, ihr eigenes Brot zu verdienen, verdienen _________; und bose wurde, wenn sie nicht auf bose ________; ihn achteten…. Achteten _________ • The researchers found that the two tests were equally valid measures of essentially the same ability. • The implication was that there was no real advantage in testing words in context.

Cloze tests as vocabulary measures • A standardized cloze test consists of one or more reading passages from which words are deleted according to a fixed ratio (e.g. every seventh word). • One modified version is the selective-deletion (or rational) cloze, where the test-writer deliberately chooses the words to be deleted, preferably according to principled criteria. • A second modification is the multiple-choice cloze. In this case, each deleted word is incorporated into a multiple-choice item and, instead of writing in the word, the test-takers have to choose which of the three or four options is the one that fills the blank. • A third alternative is the C-test, in which a series of short texts are more radically mutilated by deleting the second half of every second word.

The standard cloze • Simple correlations are not an adequate means of establishing what a test is measuring, especially when the correlation is a moderate one (e.g. in the range of 0.50 to 0.80). • Tests differ not only in what they are designed to measure (the trait) but also in the task that they set the test-taker (the method).

The rational cloze • Alderson (1979) found that a single text could produce quite different tests depending on whether you deleted (e.g. every eighth word rather than every sixth). • Alderson also obtained evidence that most cloze blanks could be filled by referring just to the clause or the sentence in which they occurred. • This led him to the view that ‘perhaps the principle of randomness needs to be abandoned in favor of the rational selection of deletions, based upon a theory of the nature of language and language processing’.

The multiple-choice cloze • Porter (1976) and Ozete (1977) argued that the standard format requires writing ability, whereas the multiple-choice version makes it more a measure of reading comprehension. • Jonz (1976) pointed out that a multiple-choice cloze could be marked more objectively because it controlled the range of responses that the test-takers could give. • One further advantage, exploited by Bensoussan (1983) and Bensoussan and Ramraz (1984), was the opportunity offered by the multiple-choice format to create items where more than one word- and even as much as a whole sentence- was deleted from the original text.

The C-test • C-test- in which a series of short texts are prepared for testing by deleting the second half of every second word- may seem to be the version of the cloze procedure that is the least promising as a specific measure of vocabulary. • Its creators intended that it should assess general proficiency in the language. • The fact that just the second half of a word is deleted might suggest that knowledge of word structure is more important in this kind of test than, say, the semantic aspects of vocabulary knowledge.

Cont. • Chapelle and Abraham (1990) found that their C-test correlated highly with their multiple-choice vocabulary test (r = 0.862). • Looking at it the other way, the vocabulary test had a stronger association with the C-test than with any of the other three versions of the cloze procedures. • The researchers interpreted this as evidence that the C-test was particularly good as a measure of what Alderson would call ‘lower-level’ knowledge of lexical and grammartical elements, while at the same time it also drew on ‘higher-level’ textual competence, as indicated by the substantial correlations with the reading and writing subtests.

Conclusion • Vocabulary knowledge is assessed indirectly through the test-takers’ performance of integrative tasks that show how well they can draw on all their language resources to use the language for various communicative purposes. • Researchers and language-teaching specialists with a specific interest in vocabulary learning have a continuing need for assessment tool.

Objective Testing in Vocabulary Assessment

Objective Testing in Vocabulary Assessment

Presentation Transcript

Research on the Assessment of Civic Engagement

Research on Classroom Assessment

Research Design Vocabulary

Classroom Research: Vocabulary Instruction

Research on formative assessment practices

Focus on Vocabulary

Vocabulary Teaching and Assessment

RESEARCH UNIT VOCABULARY

Research Vocabulary

Research on Classroom Assessment

Practice on Vocabulary

Workshop on research assessment in CERIF

Practice on Vocabulary

FOCUS ON VOCABULARY

Research Skills Vocabulary

Practice on Vocabulary

Research on the Alignment of Alternate Assessment

Research on Assessment

Research lessons on assessment and feedback

Research lessons on assessment and feedback

Classroom Research: Vocabulary Instruction

Research on Emergency Capability Assessment