stages • Test conceptualization • defining the test • Test construction • Selecting a measurement scale • Developing items • Test tryout • Item analysis • Revising the test
1. Test conceptualization • Defining the scope, purpose, and limits of the test.
Initial questions in test construction • Should the item content be similar or varied? Should the range of difficulty be narrow or broad? • ceiling effect vs. floor effect • How many items should be created?
Which domains should be tapped? • the test developer may specify content domains and cognitive skills that must be included on the test. • What kind of test item should be used?
levels of measurement • N • O • I • R
Scaling methods • Most are rating scales that are summative • May be unidimensional or multi-dimensional
Method of paired comparisons • Aka forced choice • Test taker is forced to pick one of two items paired together
Comparative scaling • Test takers sort cards or rank items from “least” to “most”
Categorical scaling • Test takers sort cards into one of 2 or more categories. • Stimuli are thought to differ quantitatively not qualitatively
Likert type scales • Response choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree). • Likert assumes an interval scale although this may not be realistically accurate.
Guttman scales • Response choices for each item are various statements that lie on a continuum. • Endorsing the most extreme statement reflects endorsement of milder statements as well.
Method of equal-appearing intervals • Presumed to be interval • For knowledge scale: • obtain T/F statements • Experts rate each item • For attitude scale • Judges rate each item on a likert scale assuming equal intervals • For both • Total test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)
Method of absolute scaling • Way to determine the difficulty level of items. • Give items to several age groups, with one age group acting as the anchor. • Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.
Method of empirical keying • Based entirely on empirical findings. • Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct. • Items are selected based on how well they distinguish one group from the other.
Item format • Selected response • Constructed response
Multiple choice • Pros---- • Cons----
Matching • Pros---- • Cons----
True/False • Pros---- • Cons---- • Forced-choice methodology.
Fill in • Pros---- • Cons----
Short answer objective item • Pros--- • Cons---
Essay • Pros---- • Cons----
Scoring items • Cumulative model • Class/category • Ipsative • Correction for guessing
3. Test tryout • Should be on group that represents the ultimate group of test takers (who the test is intended for) • Good items • Reliable • Valid • Discriminate well
Before item analysis, look at the variability of scores within the test • Floor effect? • Ceiling effect?
4. Item analysis • helps determine which items should be kept, revised, deleted.
Item-difficulty index • proportion of examinees who get the item correct. • can get a mean item difficulty.
Ideal item difficulty • when using multiple guess items, try to account for the probability of chance. • Optimal item difficulty = 1+g/2 • exception to choosing item difficulty around mid-range involves tests of extreme groups.
Item endorsement • proportion of examinees who endorsed the item.
Item reliability index • Indication of internal consistency • Product of the item SD and the correlation between the item and total scale • Items with low reliability can be eliminated
Item validity index • Correlate item with criterion – (helps identify predictively useful test items) • Multiply the item score and the criterion total score with the SD of the item. • The usefulness of an item also depends on its dispersion or ability to discriminate
Item discrimination index • how well the item discriminates between high scorers and low scorers on the test. • For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/N • U = # of pple in the upper range who got it right • L= # of pple in the lower range who got it right • N= total # of pple in the upper OR lower range.
Interpreting the IDI • can vary from –1 to +1. • A (–) number = • A 0 indicates = • The closer the IDI is to +1 • Can also use the IDI approach to examine the pattern of incorrect responses.
Item characteristic curves • “Graphic representation of item difficulty and discrimination” • horizontal line = ability • vertical line = probability of a correct response
plots the probability of a correct response relative to the position on the entire test. • If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.
Item fairness • Items should measure the same thing across groups • Items should have similar ICC across groups • Items should have similar predictive validity across groups
Speed tests • Easy items, similar items – everyone gets correct. • Measuring response time • Traditional analyses of items do not apply
Qualitative item analysis • Test takers descriptions of the test • Think aloud administrations • Expert panels
5. Revising the test • based on the info we obtained from the item analysis. New items and additional testing of these items may be required.
Cross validation • Once you have your revised test, need to seek new, independent confirmation of the test’s validity. • The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.
Validity shrinkage • Typically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.
Co-validation • Validating two or more tests at the same time • Co-norming • Saves $ • Beneficial for tests that are used together
6. Publishing the test • final step that involves development of a test manual.
Production of testing materials • Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.
Technical manual • Summarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.
User’s manual • provides instruction for administration, scoring, and interpretation. • The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135). • two of the most important: • 1. describe the rationale and recommended uses of the test • 2. provide data on reliability and validity.