Test Development
Test Development
Test Development

Test Development
Test Development

  1. Test Development

  2. stages • Test conceptualization • defining the test • Test construction • Selecting a measurement scale • Developing items • Test tryout • Item analysis • Revising the test

  3. 1. Test conceptualization • Defining the scope, purpose, and limits of the test.

  4. Initial questions in test construction • Should the item content be similar or varied? Should the range of difficulty be narrow or broad? • ceiling effect vs. floor effect • How many items should be created?

  5. Which domains should be tapped? • the test developer may specify content domains and cognitive skills that must be included on the test. • What kind of test item should be used?

  6. 2. Test construction

  7. Selecting a scaling method

  8. levels of measurement • N • O • I • R

  9. Scaling methods • Most are rating scales that are summative • May be unidimensional or multi-dimensional

  10. Method of paired comparisons • Aka forced choice • Test taker is forced to pick one of two items paired together

  11. Comparative scaling • Test takers sort cards or rank items from “least” to “most”

  12. Categorical scaling • Test takers sort cards into one of 2 or more categories. • Stimuli are thought to differ quantitatively not qualitatively

  13. Likert type scales • Response choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree). • Likert assumes an interval scale although this may not be realistically accurate.

  14. Guttman scales • Response choices for each item are various statements that lie on a continuum. • Endorsing the most extreme statement reflects endorsement of milder statements as well.

  15. Method of equal-appearing intervals • Presumed to be interval • For knowledge scale: • obtain T/F statements • Experts rate each item • For attitude scale • Judges rate each item on a likert scale assuming equal intervals • For both • Total test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)

  16. Method of absolute scaling • Way to determine the difficulty level of items. • Give items to several age groups, with one age group acting as the anchor. • Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.

  17. Method of empirical keying • Based entirely on empirical findings. • Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct. • Items are selected based on how well they distinguish one group from the other.

  18. Writing the items

  19. Item format • Selected response • Constructed response

  20. Multiple choice • Pros---- • Cons----

  21. Matching • Pros---- • Cons----

  22. True/False • Pros---- • Cons---- • Forced-choice methodology.

  23. Fill in • Pros---- • Cons----

  24. Short answer objective item • Pros--- • Cons---

  25. Essay • Pros---- • Cons----

  26. Scoring items • Cumulative model • Class/category • Ipsative • Correction for guessing

  27. 3. Test tryout • Should be on group that represents the ultimate group of test takers (who the test is intended for) • Good items • Reliable • Valid • Discriminate well

  28. Before item analysis, look at the variability of scores within the test • Floor effect? • Ceiling effect?

  29. 4. Item analysis • helps determine which items should be kept, revised, deleted.

  30. Item-difficulty index • proportion of examinees who get the item correct. • can get a mean item difficulty.

  31. Ideal item difficulty • when using multiple guess items, try to account for the probability of chance. • Optimal item difficulty = 1+g/2 • exception to choosing item difficulty around mid-range involves tests of extreme groups.

  32. Item endorsement • proportion of examinees who endorsed the item.

  33. Item reliability index • Indication of internal consistency • Product of the item SD and the correlation between the item and total scale • Items with low reliability can be eliminated

  34. Item validity index • Correlate item with criterion – (helps identify predictively useful test items) • Multiply the item score and the criterion total score with the SD of the item. • The usefulness of an item also depends on its dispersion or ability to discriminate

  35. Item discrimination index • how well the item discriminates between high scorers and low scorers on the test. • For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/N • U = # of pple in the upper range who got it right • L= # of pple in the lower range who got it right • N= total # of pple in the upper OR lower range.

  36. Interpreting the IDI • can vary from –1 to +1. • A (–) number = • A 0 indicates = • The closer the IDI is to +1 • Can also use the IDI approach to examine the pattern of incorrect responses.

  37. Item characteristic curves • “Graphic representation of item difficulty and discrimination” • horizontal line = ability • vertical line = probability of a correct response

  38. plots the probability of a correct response relative to the position on the entire test. • If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.

  39. Item fairness • Items should measure the same thing across groups • Items should have similar ICC across groups • Items should have similar predictive validity across groups

  40. Speed tests • Easy items, similar items – everyone gets correct. • Measuring response time • Traditional analyses of items do not apply

  41. Qualitative item analysis • Test takers descriptions of the test • Think aloud administrations • Expert panels

  42. 5. Revising the test • based on the info we obtained from the item analysis. New items and additional testing of these items may be required.

  43. Cross validation • Once you have your revised test, need to seek new, independent confirmation of the test’s validity. • The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.

  44. Validity shrinkage • Typically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.

  45. Co-validation • Validating two or more tests at the same time • Co-norming • Saves $ • Beneficial for tests that are used together

  46. 6. Publishing the test • final step that involves development of a test manual.

  47. Production of testing materials • Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.

  48. Technical manual • Summarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.

  49. User’s manual • provides instruction for administration, scoring, and interpretation. • The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135). • two of the most important: • 1. describe the rationale and recommended uses of the test • 2. provide data on reliability and validity.

  50. Testing is big business