1 / 21

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities. Alan Nicewander Pacific Metrics. The following are some putative advantages of CAT relative to paper-based tests, and so-called linear tests delivered by computer. Some of these are listed below with comments:

chelsa
Download Presentation

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Perspectives on CATfor K-12 Assessments:Possibilities and Realities Alan Nicewander Pacific Metrics

  2. The following are some putative advantages of CAT relative to paper-based tests, and so-called linear tests delivered by computer. Some of these are listed below with comments: CAT allows a pool of items to be used for on-demand testing of students and/or multiple testing of the same student…AND, at the same time, preserves the security of the item bank. CAT presents test items at a level appropriate for a student’s ability level. It should also be mentioned that CAT eliminates test booklets that can be stolen or lost--thereby, compromising test security.

  3. CATs are significantly shorter, and have higher measurement efficiency than linear tests. They are shorter because time is not wasted by: • Presenting low proficiency students difficult items leads to many incorrect responses from which little is learned about student proficiency. • Presenting highly proficient students with items that are so easy that the extent of their knowledge is not revealed. • Even though CATs are shorter than linear tests, they are capable of increasing the reliability of measurement in the extremes of the proficiency distribution relative to linear tests of greater length.

  4. Comments • Any type of on-demand or repeated testing carries the risk of item exposure. • A crucial variable for increasing the risk of item exposure during on demand, repetitive testing is the degree to which the test is high-stakes; the higher the stakes, the greater the pressure on the item pool for exposure. • The prime example here is the CAT-GRE, which was abandoned partly because of item security issues.

  5. To insure reasonable levels of CAT security, two methods have been found to be most effective in simulations: • Stochastic exposure control using the Sympson-Hetter method (or a similar method). • Increasing the number of items in the pool. • These findings are from initial R&D done for development of the CAT-ASVAB.

  6. Comments • It is true that CATs can be considerably shorter in length. For example, the CAT-ASVAB is 1/3 shorter than the paper-based version (129 vs. 200 items), and the reliability coefficients run about 15% higher. • However, the CAT-ASVAB has moderate exposure control and very little content balancing imposed on optimum item selection. • Increasing the levels of exposure-and-content controls can lead to longer test lengths and BATs (barely adaptive tests). • Increased levels of exposure control and content balancing lead to longer tests with lower reliability.

  7. Existing test forms can be used to produce item pools for CAT.

  8. Comments • CAT item pool development can be a daunting task. As an illustration, suppose a current, paper-based testing program is administered with three forms of a 50-item test. [Note that the item exposure rate for the current procedure is 1/3 (each time a test is given, 1/3 of the total collection of items is exposed).] • If it assumed that a CAT system can reduce test length to 35 items, how many items need to be developed to form the pools needed? • A general rule is to have pool size five times the length of the CAT; this leads to 175 items in each pool in this example.

  9. Now, further assume that students will be allowed to take the CAT three times during a year. How many item pools are needed to attain the same exposure rate as the 50-item paper-based test being replaced? • Three pools will be needed to achieve the same theoretical exposure rate as the paper-based test. • Also, a statistical exposure control (such as Sympson-Hetter) will be needed to overcome the fact that, within a pool, certain items are selected very frequently by a procedure that maximizes test information.

  10. So, we are left with these number for item-pool development: 3 pools of 175 items each = 525 items, and using the general rule that one must write twice as many items as necessary, this means that 1,050 items must be written for this rather modest CAT project. • Or perhaps more realistically, (525 – 150)*2 = 750 new items will have to be written if all the paper-based items are used in the item pools.

  11. The bottom line is that CAT: • can provide tests at a level appropriate for a student’s ability. • can save testing time and increase test reliability. • is unlikely to save money because it can be a giant, item-eating machine. • Offers the possibility of greater protection of the items from compromise than would be possible by the computer administration of a current paper-based test.

  12. Evaluating a CAT Item Pool using Optimal Adaptive Tests (OATs) • We are now going to construct some adaptive tests in an optimal way in order to illustrate some problems and to indicate an interesting possibility for implementing CAT. • If one knew a person’s standing on the latent trait, θ, it would be easy to choose a fixed number of items (from some item pool) that will maximize the test information. • We call such a test an “optimal adaptive test” (OAT) in that no other test from this item pool, and of the same length, could exceed this test’s measurement accuracy.

  13. The use of OATs for evaluating an item pool is now illustrated using an operational item pool for mathematics. • This item pool contains 84 items, and is used to construct 15 item adaptive tests for various values of the latent trait. • The items in the pool have • an average a-value of 1.61; S.D. = .51 • an average b-value of -.06; S.D. = 1.10 • and an average c-value of .15; S.D. = .07 • For its intended purpose, this is an excellent item bank.

  14. Using a grid of θ’s from -3 to 3 at intervals of .5, 13 OATs were constructed from the 84-item bank. • In order to illustrate the item-overlap in this collection of OATs, three of these were designated as focal OATs. • These focal OATs were those at θ = -1.5, 0 and 1.5. • One might think of these as the optimal tests for three cut-scores. • In the next three slides (one for each of the focal OATs), the overlap with neighboring OATs are shown. • Accuracy of the OATs are indicated with information functions and reliability coefficients.

  15. OAT at θ = -1.5 and Overlap with Neighboring OATs

  16. OAT for θ = 0 and Neighboring OATs

  17. OAT for θ = 1.5 and Neighboring OATs

  18. Focal OATS Derived Using the Rasch Model

  19. Conclusions • The previous slides indicate that there will be considerable overlap in the CATs constructed from this item bank--in spite of the fact that there is considerable variability in the difficulty of the items. • Hence, many of the items will be “overly-exposed” and subject to compromise. • In the actual use of this item bank, the exposure of items is controlled using the Sympson-Hetter Exposure Control method.

  20. The previous slides also indicate that the three focal, OATs , optimal for θ = -1.5, 0 and 1.5, do a rather remarkable job of providing accurate measurement across the θ-continuum even though they only contain 15 items each. • OATs—and by implication, CATs in general—will differ depending on the IRT model used in development and implementation.

  21. This also suggests, that a two-stage, CAT procedure would work quite well with this item bank. • In a two-stage CAT, an initial, Stage 1 test is administered in either a CAT mode or as a fixed, medium difficulty test. • Scores on the Stage 1 test are used to assign examinees to one of several Stage 2 tests which vary in overall difficulty from easy to difficult—for example one of the three, focal OATs described above. • In this case (and perhaps in most cases), a pure CAT, where items are selected “on the fly”, does not seem to have any advantages over the pre-selected, optimal, Stage 2 tests.

More Related