Effective Implementation of the International Test Commission Guidelines for Adapting Tests

Effective Implementation of the International Test Commission Guidelines for Adapting Tests Ronald K. Hambleton, Shuhong Li University of Massachusetts, USA ICP, Beijing, China, August 10, 2004

Background • Interest in Test Translations and Test Adaptations Has Increased Tremendously in Past 15 Years. --IQ and Personality Tests in 50+ languages --Achievement Tests for Large Scale Assessments (PISA, TIMMS) in 30+ Languages --International Uses of Credentialing Exams Is Expanding.

Background • Medical Health Researchers With Their Quality of Life Measures (Huge Field) • Marketing Research

Problems • All Too Often, Test Translation and Adaptation Process Is Not Understood (e.g., hiring one translator) --Limited Technical Work (e.g. back-translation only) --Literal Translations Only --Validity Initiatives End With Judgmental Analyses

Problems • 22 ITC Guidelines for Test Adaptation Are Becoming Well-known, and Are Often Referred to in the Literature. • But, Not Always Clear to Practitioners How These Guidelines Might Be Applied. • Paper by van de Vijver and Tanzer (1997)– very useful but not directly linked to the guidelines, and many new developments since 1997.

Purposes of the Research • Provide Specific Ideas for Applying the ITC Test Adaptation Guidelines. [The paper includes many excellent examples of applications.] [Successful adaptation is a mixture of good designs, excellent translators, questionnaires, observations, good judgments, statistical analyses, validity studies, etc.]

C.1 The amount of overlap in the constructs in the populations of interest should be assessed. • Is the meaning of the construct the same over language groups? • Exploratory factor analysis, and especially confirmatory factor analysis (SEM), multidimensional scaling. --see work by Byrne with self-concept measures; van de Vijver (2004) with CPI; Gregoire (2004) with WAIS-III. [cont.]

C.1 The amount of overlap in the constructs in the populations of interest should be assessed. • Judgment of the content suitability --Routinely done with the international assessments, via questionnaires, and face-to-face committee meetings. • Assessing “nomological nets”—basically, investigating a pattern of test results that includes external factors (i.e., construct validity investigations)

C.2 Effects of cultural differences which are not relevant to the purpose of the study should be minimized. • Across language and cultural groups: Same motivational level? Same understanding of directions? Same impact of speed? Common experience? If not, fix! --Questionnaires, observations, local experts, etc. can provide valuable evidence. [cont.]

C.2 Effects of cultural differences which are not relevant to the purpose of the study should be minimized. • Assessment of “cultural difference.” --assess differences in language, family structures, religion, lifestyle, values, etc. (see van de Vijver & Leung, 1997) --statistical methods such as ANCOVA may allow differences to be removed statistically.

D.1 Insure test adaptation takes account of linguistic and cultural differences. • This guideline is really about the translators and their qualifications: they must know languages, cultures, basic test development knowledge, and subject matter/construct. --Evaluate process used in selecting translators (avoid “convenience” as a criterion) --Use of multiple translators [cont.]

D.1 Insure test adaptation takes account of linguistic and cultural differences. • D.1 has been one of the most widely applied guidelines: Agencies are evaluating translators more carefully, and frequently using multiple translators. See Meara (2004), Grisay (2004).

D.2 Provide evidence that directions, scoring rubrics, formats, are applicable. • If possible, begin in source language to choose concepts, formats, etc. that will adapt easily. --Qualified translators can be very useful. --Develop checklists for translators to watch for unfamiliar words, lengths of sentences, culturally specific concepts, etc. Have them sign off. See Meara (2004). [cont.]

D.2 Provide evidence that directions and scoring rubrics, item formats, and items are widely applicable. • Another successful guideline: checklists/rating scales have been developed to focus on content, conceptual, and linguistic equivalence. See Jeanrie and Bertrand (1999).

D.3 Formats, instructions, and test itself should be developed to maximize utility in multiple groups. • The meaning of this guideline seems clear. --Compile evidence on target group—via questionnaires, observations, discussions with testing specialists, small tryout. (van de Vijver & Tanzer, 1997) [cont.]

D.3 Formats, instructions, and test itself should be developed to maximize utility in multiple groups. --Are training materials available, in case the tests are unusual or new? --Were training materials evaluated for their success? --Consider balancing formats in the test

D.4 Item content should be familiar to persons in the target languages. • Here, we mean ”judgmental reviews.” --Develop checklists for reviewers in the target language. (As is done to detect gender and ethnic bias or evaluate test items.) --If changes are made (e.g., dollars to pounds) be sure the changes are judged as psychologically equivalent.

D.5 Linguistic and psychological evidence should be used to improve the test, and address equivalence. • Implement forward and backward translation designs for effective review. --Were multiple translators used? --Both designs? --Probes of target language/culture respondents? --Administration of source and back translated versions? [cont.]

D.5 Linguistic and psychological evidence should be used to improve the test, and address equivalence. • Gregoire’s (2004) work to build equivalent scales and test dimensionality in French to match English version of the WAIS-III

D.6 Choose a data collection design to provide statistical evidence to establish item equivalence. • DIF, SEM, IRT studies are valuable but suitable designs are needed for effective analyses. (Bilingual, Mono-Mono) --Are sample sizes large enough? Representative of the populations? • See Muniz et al. (2001) for small sample study of MH and conditional p-values.

D.7 Use psychometric and statistical techniques to establish test equivalence and test shortcomings. • Primary concern is that statistical procedures are consistent with data assumptions. --No common scale—can do SEM --Common scale (unconditional)—ANOVA, LR, delta plots --Common scale (conditional)—IRT

D.8 Provide technical evidence to support the validity of the test in its adapted form. • Validity of a translated/adapted test cannot be assumed. Many cultural factors can be present. One of most common myths! Level of effort is tied to importance and cultural difference in source and target groups. [cont.]

D.8 Provide technical evidence to support the validity of the test in its adapted form. --item analysis, reliability, validity studies (content, criterion-related, construct) in relation to stated purposes are needed on translated/adapted test.

D.9 Provide evidence of item equivalence in multiple languages. • Lots of methodology here. Simple extension of DIF methodology. --Delta plots, standardized p-differences, b-value plots, Mantel-Haenszel, logistic regression, and much more. • Van de Vijver (2004) excellent example, and grapples with the issue of effect sizes (not all statistical differences are consequential) [cont.]

D.9 Provide evidence of item equivalence in multiple languages. • Zumbo (2004) shows that SEM procedures do not necessarily spot item level DIF—so these analyses are very important.

D.10 Non-equivalent items should be eliminated from linking process. • If linking of scales is in the plan, watch for non-functioning test items and eliminate. But items can remain in the source language version. --see the example. --eliminated items from the “link” may still be valuable in the source and target languages with unique item statistics.

Figure 3. Delta Plot with 40 Anchor Items. Linear Equating Line: y = 1.09x - .44 Major Axis of the Ellipse y = 1.11x - .66

A.1 Try to anticipate test administration problems and eliminate them. • Next six test administration guidelines are clear. --Empirical evidence is needed to support the claim of equivalence. Observations, interviews with respondents and administrators, local experts, small tryout, analysis of response times, practice effects, well-trained administrators, etc.

A.2 Be sensitive to problems with tests, administration, format, etc. that might lower validity. • This guideline is clear. --Work from checklists of common problems (e.g., hard words, format familiarity, rating scales, role of speed, separate answer sheets, using keyboards, availability of practice materials, etc.)

A.3 Eliminate aspects of the environment that may impact on performance. • Watch for environmental factors that may impact on test performance and reduce validity. --Again, observations, interviews, checklists can be helpful. Will respondents be honest? Maximize performance, if achievement tests? Does administrator have skills to follow directions closely?

A.4 Minimize problems with test administration directions. • Instructions can be problematic across cultural groups. --Use a checklist to watch for clarity in instructions to the respondents (e.g., simple words, avoidance of passive tense, specific rather than general directions, use of examples to explain item formats, etc.)

A.5 Identify in the manual administration details that need to be considered. • Test manuals need to reflect all the details for test administration --Does the test manual describe administration procedures that are understandable and based on field-test experience? --Does the manual emphasize the need for standardization in administration?

A.6 Administrators need to be unobtrusive, and examiner-examinee interaction minimized. • This guideline is about minimizing the role of administrator—gender, ethnic background, age, etc. --Were standardized procedures followed? --Was training effective? --Were local cultural norms respected? --Were pilot studies carried out?

I.1 With adapted tests, document changes that have been made, and evidence of equivalence. • It is important to keep records of procedures in the adaptation process, and changes. --A record of process can be valuable. Becomes part of argument for validity. For example, how were qualified translators identified?

I.2 Score differences should not be taken at face value. Compile validity data to substantiate the differences. • Easy to interpret differences in terms of achievement but why are there differences? --Look at educational policies, resources, etc. --Form a committee to interpret findings. (Need diversity of opinions.) --Be aware of other research that might help the interpretations. • Chung (2004) with psychological tests in China. TIMSS and PISA studies.

I.3 Comparisons across populations can be made at the level of invariance that is established for the test. • Are scores in different populations linked to a common scale? If not, comparisons are problematic. (And “statistical equating” is a complicated process.) --Are scores being interpreted at the level of invariance that has been established?

I.4 Specific suggestions for interpreting the results need to be offered. • The idea here is that it is not good enough to produce the test results, a basis for interpretations should be offered. --Are possible interpretations offered? --Are cautions for misinterpretations offered? --Are factors discussed that might impact on the results? [cont.]

I.4 Specific suggestions for interpreting the results need to be offered. --Hierarchical Linear Modeling (HLM) being used to build causal models to explain results.

Conclusions • Test adaptation practices should improve with methodology linked to the guidelines. • What’s needed now are more comprehensive examples for how these guidelines are being applied. [e.g., Grisay, 2004; Meara (2004); papers in this symposium]

Follow-Up Reading • See work being done by TIMSS and OECD/PISA—outstanding quality • Language Testing (2004 special issue) • Hambleton, et al. (2004) Adaptation of educational and psychological tests. Erlbaum.

Paper Request • Please contact the first author at RKH@educ.umass.edu for a copy of the paper.

Effective Implementation of the International Test Commission Guidelines for Adapting Tests