One Ruler, Many tests

One Ruler, Many tests A Primer on Test Equating

The Problem • People in APEC economies need to communicate across national, cultural, and economic boundaries. • This requires fluency in an international language, such as English. • Each APEC economy has its own system for teaching and assessing English fluency. • There is no common language scale. • There are no common standards.

Domain of Language Fluency

Some Options • Use scores from national exams • Use one of the international English exams: • TOEFL (Test of English as a Foreign Language) • TOEIC (Test of English for International Communication) • IELTS (International English Language Testing System) • Develop a parallel to the European system (Common European Framework of Reference for Languages) • Develop new tests specifically for the APEC • Equate existing APEC tests

Limitations of Existing Exams • Limited supply of seats for international English fluency exams • Each test is on its own scale • The TOEFL and TOEIC are proprietary to Educational Testing Service (ETS) • Appropriate only for adults • National tests are not comparable across APEC • None of these scales seems appropriate for establishing a single set of standards

Psychometric Solutions • Test Equating • International Item Banking • Computer Adaptive Testing • The Lexile Scale • These methodologies require some background knowledge about Item Response Theory (IRT)

Classical Definition of Equating • Two tests X and Y are equated when from an examinee’s score on X it is possible to infer his score on Y, and vice versa.

Classical Test Theory • Same test for everyone • Representative sample of examinees • Representative sample of items • Equate tests through common persons who take both tests • Do not use item-level data much • Person measures depend on the items • Test and item measures depend on the persons • No way to handle missing data • No way to assess quality of responses

Item Response Theory • Famous names: Georg Rasch, Ben Wright, Fred Lord, Allan Birnbaum, 1950s – 1980s • Probability of a Correct Response = function of (Person Ability – Item Difficulty) • IRT models predict values for missing responses • Competing probability models: • Rasch model, no extra parameters (Rasch, Wright) • 2-PL and 3-PL models, add guessing parameter and item discrimination parameter (Lord, Birnbaum) • Rasch model has a special property when the data fit the model: • Objectivity = Invariance = Generalizability

Objectivity • “Objective measurement” is when the relative measures of persons are the same regardless of which items they take, and the relative measures of items are the same regardless of which persons take them. • The Rasch model does not produce objectivity. It requires it as a condition of fit to the data. The model is used to “edit” the data set until objectivity is achieved. • Classical Test Theory does not have this property. • The 2-PL and 3-PL models fit the data better than the Rasch model and do not require editing, but at the expense of objectivity.

Missing Data is Key to Test Equating

Equating Using IRT

Three Rules of Equating • Valid Comparisons Examinees can only legitimately be compared when they can be said, either literally or theoretically, to have taken the same test. • IRT Definition of Equating Two tests X and Y are equated when from the responses on X and the responses on Y it is possible to infer the responses and total score that students would have received on a common test XY composed of the items from both tests. • Test Linking In order to infer responses on one test based on another, the tests must somehow be linked.

Three Ways to Link Tests X and Y • Common Persons Tests X and Y must be administered, in their entirety, to a common sample of persons at more or less the same time. • Common Items Tests X and Y must have items in common. This is called common-item equating. • Common Objective Characteristics Tests X and Y must have “objective characteristics” in common. There must be some way to infer from the test questions themselves their likely difficulty without recourse to person responses.

Technical Clarifications • Estimation IRT algorithms do not bother calculating values for missing cells (unless requested). They use more efficient algorithms to calculate person and item measures that are equivalent to filling in the missing cells. • Measurement units IRT algorithms do not work in “percent correct.” They work in “log-odds units” or “logits,” which are a way of going back and forth between probabilities and a ruler-like linear metric. • Error My pseudo-data set has random error added to it. Error is intrinsic to real-world data and binary raw responses. That is why we work in probabilities.

IRT Requires Unidimensionality • Unidimensionality means that all the items in a test are sensitive to the same construct, and only that construct. Items differ from each other only in their difficulty, nothing else. • An examinee is very good at math but poor at English. He is given an easy word problem and gets it wrong. • Question: Should we mark the word problem wrong or not?

Multidimensional IRT • Models exist, but the field has not matured • Properties of a true MIRT model: • Fit multidimensional data • Predict missing values • Invariant person positions in n-space • Invariant item positions in the same n-space • Misfit when invariance is not achieved • Transferability of coordinates • Maximal use of information in data set • Standard errors for cell estimates and person/item parameters • Example: My own model called NOUS

Tools Relevant to APEC • International Item Bank • Computer Adaptive Testing • Lexile scale

International Item Bank

International Item Bank Benefits • How it works • Participating countries “withdraw” items • Combine bank items with own items, administer • New items, with data, are deposited to bank • Items are edited and calibrated, made available • Benefits • Different tests, a common scale • Excellent test security • Test all grades and ability levels • Freedom of use, not tied to proprietary tests • Computer adaptive testing

Computer Adaptive Testing (CAT) • How it works • At least 12,000 items spread across all ability levels • Unidimensionality rigorously enforced • Item difficulties pre-calibrated • Examinee is administered items that are matched to his continuously recalculated ability, until a desired precision is achieved. • Benefits • Fast, secure, computerized • All examinees are measured with equal precision • None of the problems of paper tests • Problematic with Writing and Speaking

The Lexile Scale • How it works • Rasch-based software uses “objective characteristic” equating to measure the relative reading difficulty of text passages • Two “objective characteristics” used as predictors: • Semantic Difficulty (how frequently is each word in a passage used in general English) • Syntactic Complexity (how many words are in each sentence) • How it is used • Tests are scanned and Lexile difficulties calculated • Books and curricula are also scanned and “Lexiled” • Examinees receive Lexile reading ability scores • Teachers match examinees to materials

Benefits of the Lexile Scale • All APEC tests on one English fluency scale Scale is objective, rigorous, and already applied to many assessments and thousands of books • Practical, inexpensive, straightforward MetaMetrics scans tests, assigns Lexiles. No equating studies, no need for item banks. • Usable by teachers Lots of Lexiled curricular materials and books (See attached excerpt of the Lexile scale, from MetaMetrics) • Easy to set performance standards Each level of the scale has thousands of examplars

More Lexile Benefits • Transparency The scale is based on objective test characteristics that can be independently verified and replicated. If MetaMetrics disappears the Lexile scale can be recreated. Proprietary tests like TOEFL and TOEIC have scales completely dependent on their creators. • Stability A scale based on objective characteristics is less likely to change difficulty or distort over time. • Applicable to many kinds of language fluency • Academic • Business • Reading, Writing (not yet Listening and Speaking) • English and Spanish, extendable to Asian languages

Conclusion • The problem of establishing a common scale and common language standards for the APEC economies – particularly as regards the learning, teaching, and usage of English – is eminently solvable. • Psychometric tools have been developed over the last 40 years in response to the need to equate tests. • Three such tools relevant to APEC needs are: • An International Item Bank • Computer Adaptive Testing • The Lexile framework

Contact information • Personal Information Mark H. Moulton, Ph.D., Educational Data Systems markm@eddata.com • Complete “One Ruler…” paper, plus NOUS and multidimensional equating www.eddata.com/resources/publications/ • Rasch Models www.winsteps.com/ • Item banks and CAT www.nwea.org/ • The Lexile framework www.lexile.com/EntrancePageFlash.html

One Ruler, Many tests