The Effect of Anchor Test Construction on Scale Drift

1. The Effect of Anchor Test Construction on Scale Drift Judit Antal Gerald Melican Thomas Proctor Andrew Wiley The College Board

2. Anchor Tests Used for Non-Equivalent Anchor Test (NEAT) design to link test forms to be equated by measuring differences between populations when equivalent groups are not available; Common practice building anchors Miniature version of the total test Representative of content and statistical specification of the full test 2 Anchor tests are used in non-equivalent anchor test design or NEAT design to link test forms to be equated by measuring differences between populations when equivalent groups are not available. The NEAT design is one of the most commonly used and well studied design, which assumes that the groups of students taking the two tests to be equated are not equivalent in the knowledge and skills being assessed. Therefore, it depends on the presence of a common item set to control for any differences in the difficulty of the test forms. One of the most common recommendations for building an anchor test is the development of a set of items that represent a miniature form of the total test with respect to content coverage and statistical characteristicsAnchor tests are used in non-equivalent anchor test design or NEAT design to link test forms to be equated by measuring differences between populations when equivalent groups are not available. The NEAT design is one of the most commonly used and well studied design, which assumes that the groups of students taking the two tests to be equated are not equivalent in the knowledge and skills being assessed. Therefore, it depends on the presence of a common item set to control for any differences in the difficulty of the test forms. One of the most common recommendations for building an anchor test is the development of a set of items that represent a miniature form of the total test with respect to content coverage and statistical characteristics

3. Anchor Tests Statistical representativeness Equal mean and equal standard deviation of item difficulties Cover difficulty range of the full test Sinharay/Holland Study (2007) Suggests to relax some statistical requirements for external anchors Compared �mini� (same SD) and �midi� (reduced SD) anchor tests Recommendation: SD can be reduced for external anchors 3 With respect to statistical characteristics, the recommendation has generally been that the mean and standard deviation of the difficulty of the items be approximately equal to the same statistics for the full version. For most tests, this requirement implies that the anchor test will need to cover the entire difficulty range, including items that are very difficult and very easy. It is a difficult task for item developers to cover the entire range of the full test. This prompted Sinharay and Holland�s research in 2007 to investigate that what would happen if they relax some of the requirements of building anchor tests? They compared equating results based on a regularly defined anchor test (they called it �mini anchor�) and a �midi� anchor where the mean of the midi anchor was equal to the main test�s mean but the standard deviation was only half of the main test. They found that the two types of anchors performed very similarly, which suggest the requirement regarding the SD of the anchors can be relaxed for external anchors, which would make test developers life much easier. With respect to statistical characteristics, the recommendation has generally been that the mean and standard deviation of the difficulty of the items be approximately equal to the same statistics for the full version. For most tests, this requirement implies that the anchor test will need to cover the entire difficulty range, including items that are very difficult and very easy. It is a difficult task for item developers to cover the entire range of the full test. This prompted Sinharay and Holland�s research in 2007 to investigate that what would happen if they relax some of the requirements of building anchor tests? They compared equating results based on a regularly defined anchor test (they called it �mini anchor�) and a �midi� anchor where the mean of the midi anchor was equal to the main test�s mean but the standard deviation was only half of the main test. They found that the two types of anchors performed very similarly, which suggest the requirement regarding the SD of the anchors can be relaxed for external anchors, which would make test developers life much easier.

4. Study Goals Link multiple forms to mimic a multi-year testing program with multiple forms administered in a longer time spam � not only two forms Evaluation of scale drift across the chain using �mini� vs. �midi� anchor concept proposed by Sinharay/Holland (2007) Scale drift across equating designs (NEAT vs. RG) within each type of anchor Scale drift across anchor types within equating designs Current study establishes a baseline for later comparisons 4 Most testing programs, however, do not have a need for simply equating one test form to another. Instead, they have equating chains which runs across multiple test forms and can stretch out several administrations over years. Therefore, we decided to investigate whether building anchor tests using the midi-test concept will make the testing program more or less vulnerable to scale drift. 1. We wanted to investigate scale drift across equating designs (NEAT vs. RG) within each type of anchor and 2. Scale drift across anchor types within equating designs This research represents an attempt to establish a baseline for later comparisons in that the difficulty level of tests and the ability levels of the populations were maintained to be equal. Later research based on varying difficulty and ability will be performed. Most testing programs, however, do not have a need for simply equating one test form to another. Instead, they have equating chains which runs across multiple test forms and can stretch out several administrations over years. Therefore, we decided to investigate whether building anchor tests using the midi-test concept will make the testing program more or less vulnerable to scale drift. 1. We wanted to investigate scale drift across equating designs (NEAT vs. RG) within each type of anchor and 2. Scale drift across anchor types within equating designs This research represents an attempt to establish a baseline for later comparisons in that the difficulty level of tests and the ability levels of the populations were maintained to be equal. Later research based on varying difficulty and ability will be performed.

5. Design and Data Simulation Mathematics Test of a College admission exam was used to obtain initial parameter estimates Main test: 54 items External anchor tests: 20 items All items are dichotomous 2PL IRT model (w/ Parscale) Two series of six tests are generated with external anchors Two anchor tests are embedded into each test form to facilitate common item equating 5 A Mathematics Test of a College admission exam was used to obtain initial parameter estimates for the simulation Main test: consists of 54 items 2 External anchor tests: consists of 20 items each All items are dichotomous 2PL IRT model was used in the simulation (w/ Parscale) Two series of six tests were generated so that the properties of chain equating can be evaluated One chain included tests with �mini� anchors and a second chain included tests with �midi� anchors Each test form included two embedded anchor tests to facilitate common item equating A Mathematics Test of a College admission exam was used to obtain initial parameter estimates for the simulation Main test: consists of 54 items 2 External anchor tests: consists of 20 items each All items are dichotomous 2PL IRT model was used in the simulation (w/ Parscale) Two series of six tests were generated so that the properties of chain equating can be evaluated One chain included tests with �mini� anchors and a second chain included tests with �midi� anchors Each test form included two embedded anchor tests to facilitate common item equating

6. 6 Equating Design Here is the graphical representation of the design. Each pair of forms are linked with a different anchor set. We really wanted to mimic a real assessment and it seemed unrealistic to use one anchor test across the whole chain. Anchor #1 links to previous form, Anchor #2 links to future form Six test forms (for each anchor scenario) needed to be created where each test contains a main section (54 items) and two anchor sections (20 item each). The three sections of each test needed to be parallel with matching means and distributional properties. The anchor tests that links consecutive forms not only needed to be parallel but their item parameters had to be the same as they represent the same item set. � We made the means of the item difficulties for all six main forms the same as the old main test form to create equivalent groups. For the main test across the two chains item parameters were fixed to be able to compare the drift across the two conditions.Here is the graphical representation of the design. Each pair of forms are linked with a different anchor set. We really wanted to mimic a real assessment and it seemed unrealistic to use one anchor test across the whole chain. Anchor #1 links to previous form, Anchor #2 links to future form Six test forms (for each anchor scenario) needed to be created where each test contains a main section (54 items) and two anchor sections (20 item each). The three sections of each test needed to be parallel with matching means and distributional properties. The anchor tests that links consecutive forms not only needed to be parallel but their item parameters had to be the same as they represent the same item set. � We made the means of the item difficulties for all six main forms the same as the old main test form to create equivalent groups. For the main test across the two chains item parameters were fixed to be able to compare the drift across the two conditions.

7. Simulation Built 2 series of tests: a �mini� (equal SD) chain and a �midi� (0.5 SD) chain Anchor tests that link consecutive forms not only needed to be parallel but their item parameters had to be the same as they represent the same item set Means of the item difficulties for all six main forms needed to be the same as the old main test form to create equivalent groups. Item parameters needed to be fixed across the two scenarios to be able to compare the drift across the two anchor types. 7 Six test forms (for each anchor scenario) needed to be created where each test contains a main section (54 items) and two anchor sections (20 item each). The three sections of each test needed to be parallel with matching means and distributional properties. The anchor tests that link consecutive forms not only needed to be parallel but their item parameters had to be the same as they represent the same item set. � The means of the item difficulties for all six main forms needed to be the same as the old main test form to create equivalent groups. Item parameters needed to be fixed across the two scenarios to be able to compare the drift across the two scenarios. Six test forms (for each anchor scenario) needed to be created where each test contains a main section (54 items) and two anchor sections (20 item each). The three sections of each test needed to be parallel with matching means and distributional properties. The anchor tests that link consecutive forms not only needed to be parallel but their item parameters had to be the same as they represent the same item set. � The means of the item difficulties for all six main forms needed to be the same as the old main test form to create equivalent groups. Item parameters needed to be fixed across the two scenarios to be able to compare the drift across the two scenarios.

8. Data Simulation 100 replications for each test (600 across the chain) Baseline condition: Equal form difficulty across six test forms No difference in ability from year-to-year Fixed test sizes (54, 20, 20) 4,000 simulees for each form Simulation condition: anchor type (�mini� vs. �midi�) 8

9. Simulation: item difficulty statistics 9 This slide displays the means and standard deviations of the generating item difficulties for both chains. The first table includes the estimates when �mini� anchors were used, the second includes the ones for the midi tests. You can see that item difficulty means are centered at the mean of the old form difficulty parameters, they are equal across all sections of all forms x=-0.59. For mini anchors, the SDs were also made to be approximately equal to the SD of the main test. According to our linking design, each test form is linked back to a previous form with anchor 1 (except form X0) and linked forward to a future form using anchor (except Form X5). Each consecutive form is linked through the same set of items and therefore, their SDs are much more similar than the two anchors embedded in the same test.This slide displays the means and standard deviations of the generating item difficulties for both chains. The first table includes the estimates when �mini� anchors were used, the second includes the ones for the midi tests. You can see that item difficulty means are centered at the mean of the old form difficulty parameters, they are equal across all sections of all forms x=-0.59. For mini anchors, the SDs were also made to be approximately equal to the SD of the main test. According to our linking design, each test form is linked back to a previous form with anchor 1 (except form X0) and linked forward to a future form using anchor (except Form X5). Each consecutive form is linked through the same set of items and therefore, their SDs are much more similar than the two anchors embedded in the same test.

10. Simulation: parameter recovery for �a� - equal SD (mini) 10 A recovery study was also conducted on the generated response sets to evaluate the quality of root mean square error (RMSE), which represents the average squared difference between the generating and estimated values. This involved running Parscale on the generated 600 response files and compare parameter estimates. A recovery study was also conducted on the generated response sets to evaluate the quality of root mean square error (RMSE), which represents the average squared difference between the generating and estimated values. This involved running Parscale on the generated 600 response files and compare parameter estimates.

11. Simulation: parameter recovery for �b� - equal SD (mini) 11

12. Simulation: parameter recovery for �a� - reduced SD (midi) 12

13. Simulation: parameter recovery for �b� - reduced SD (midi) 13

14. 14 Evaluation Criteria In assessing the impact of using mini vs. midi anchor tests on equating we used the random group equating as the basis for comparison. Each form is equated back to Form 0 (base form) either through a chained equipercentile method using an external anchor or directly using the random groups equipercentile method. These random group equated scores served as the �true� equated scores used. Three metrics was used to examine the discrepancies between the �true� equated scores and the 100 replicated equating results. Root mean squared difference where fi is the frequency of the ith raw score, xi.AN is the raw score equivalent for the NEAT design where A represent the anchor type, and xi.R is the raw score equivalent for the RG design. Mean absolute difference Mean signed difference In assessing the impact of using mini vs. midi anchor tests on equating we used the random group equating as the basis for comparison. Each form is equated back to Form 0 (base form) either through a chained equipercentile method using an external anchor or directly using the random groups equipercentile method. These random group equated scores served as the �true� equated scores used. Three metrics was used to examine the discrepancies between the �true� equated scores and the 100 replicated equating results. Root mean squared difference where fi is the frequency of the ith raw score, xi.AN is the raw score equivalent for the NEAT design where A represent the anchor type, and xi.R is the raw score equivalent for the RG design. Mean absolute difference Mean signed difference

15. Results 15 This table displays the raw score means and standard deviations for each of the six main test forms and the external anchor test forms associated with each of the six main forms. On average, the means and SDs of the three tests for each form was stable across both forms within a condition for each condition.This table displays the raw score means and standard deviations for each of the six main test forms and the external anchor test forms associated with each of the six main forms. On average, the means and SDs of the three tests for each form was stable across both forms within a condition for each condition.

16. Results 16 The correlations between the main test and either anchor test within a form were higher than the correlation between anchor tests within a form, a result that holds across all forms and both conditions. The correlations between the main test and either anchor test within a form were higher than the correlation between anchor tests within a form, a result that holds across all forms and both conditions.

17. Results 17 This table contains the average weighted and equal weighted root mean square difference (RMSD), mean absolute difference, and mean signed difference for unrounded raw score equivalents, relative to the criterion equating. When comparing conditions neither the mini or midi condition appears to produce larger errors relative to the criterion. For both conditions, error is clearly accumulating the further up the chain of forms relative to the criterion equating. This table contains the average weighted and equal weighted root mean square difference (RMSD), mean absolute difference, and mean signed difference for unrounded raw score equivalents, relative to the criterion equating. When comparing conditions neither the mini or midi condition appears to produce larger errors relative to the criterion. For both conditions, error is clearly accumulating the further up the chain of forms relative to the criterion equating.

18. Results: NEAT vs. RG for �mini� chain 18 The following two plots show the differences in unrounded raw score equivalents between the NEAT equivalents scores and the random groups (RG) equivalent scores for both conditions. This plots depicts depicts the unrounded raw score differences between NEAT vs. RG equivalents for the mini anchor condition. A useful criterion for difference plots is the difference that matters (DTM; Dorans & Feigenbaum, 1994), which in raw score points would be 0.5 and lead to different equated rounded raw scores. For both conditions no score point exceeds the DTM, indicating that both the mini and midi conditions, relative to the criterion, produce the same equated scores. � The following two plots show the differences in unrounded raw score equivalents between the NEAT equivalents scores and the random groups (RG) equivalent scores for both conditions. This plots depicts depicts the unrounded raw score differences between NEAT vs. RG equivalents for the mini anchor condition. A useful criterion for difference plots is the difference that matters (DTM; Dorans & Feigenbaum, 1994), which in raw score points would be 0.5 and lead to different equated rounded raw scores. For both conditions no score point exceeds the DTM, indicating that both the mini and midi conditions, relative to the criterion, produce the same equated scores. �

19. 19 Results: NEAT vs. RG for �midi� chain

20. Results 20 This table contains the discrepancy statistics calculated across the two chains (between the mini and midi conditions). For the RG equating method the indices suggest that differences in equivalent raw scores across conditions are stable. For the NEAT method, there was less error between conditions than for the RG method for Form 1, but for the other four forms the error increased with progression up the equating chain for the NEAT method, such that the errors between conditions is slightly greater than errors within conditions. This table contains the discrepancy statistics calculated across the two chains (between the mini and midi conditions). For the RG equating method the indices suggest that differences in equivalent raw scores across conditions are stable. For the NEAT method, there was less error between conditions than for the RG method for Form 1, but for the other four forms the error increased with progression up the equating chain for the NEAT method, such that the errors between conditions is slightly greater than errors within conditions.

21. Results: RG across chains (mini vs. �midi�) 21 The next two slides show the differences in unrounded raw score equivalents between conditions for the two equating methods. The next two slides show the differences in unrounded raw score equivalents between conditions for the two equating methods.

22. Results: NEAT across chains (mini vs. �midi�) 22 It is evident from these plots that the differences in equivalent scores between conditions for the NEAT method results in larger differences throughout a larger range of raw scores than does the RG equating method. However, for both equating methods no score point exceeds the DTM, indicating that across conditions the same equivalent score would be obtained.It is evident from these plots that the differences in equivalent scores between conditions for the NEAT method results in larger differences throughout a larger range of raw scores than does the RG equating method. However, for both equating methods no score point exceeds the DTM, indicating that across conditions the same equivalent score would be obtained.

23. Conclusions & Future Research Results support Sinharay/Holland recommendation SD can be reduced when building external anchor tests Continuing research efforts: Various ability levels and form difficulties; Total/anchor length ratios; Small sample equating; Correlations between anchor and total test; Effect on IRT equating methods; Real data applications 23 The 2007 Sinharay and Holland study suggested that it may be possible to relax the rules regarding the distribution of item difficulties necessary when constructing an anchor test This study completed an extension of the work by Sinharay and Holland by evaluating if the accumulation of error across multiple test forms could be a concern with the adjustments recommended by Sinharay and Holland. Overall, the results of this study support the proposal by Sinharay and Holland under very strict item and simulee generation method. Moving forward, multiple other studies will be required before a complete evaluation of the proposal is complete. Some of the key characteristics that will need to be investigated include: the ability levels of the populations, the difficulty levels of the total test forms, and the correlations between the anchor and total test scores. The effect of sample sizes and different types of equating (e.g., IRT-based methods) also need to be investigated. Lastly, and likely most importantly, a future study that will evaluate the impact of the type of anchor test which uses actual test results rather than simulated data should be completed as well. The 2007 Sinharay and Holland study suggested that it may be possible to relax the rules regarding the distribution of item difficulties necessary when constructing an anchor test This study completed an extension of the work by Sinharay and Holland by evaluating if the accumulation of error across multiple test forms could be a concern with the adjustments recommended by Sinharay and Holland. Overall, the results of this study support the proposal by Sinharay and Holland under very strict item and simulee generation method. Moving forward, multiple other studies will be required before a complete evaluation of the proposal is complete. Some of the key characteristics that will need to be investigated include: the ability levels of the populations, the difficulty levels of the total test forms, and the correlations between the anchor and total test scores. The effect of sample sizes and different types of equating (e.g., IRT-based methods) also need to be investigated. Lastly, and likely most importantly, a future study that will evaluate the impact of the type of anchor test which uses actual test results rather than simulated data should be completed as well.

24. 24 Thank You!

The Effect of Anchor Test Construction on Scale Drift

The Effect of Anchor Test Construction on Scale Drift

Presentation Transcript

The Drift on Plankton

Major Scale Construction

Small Scale Effect

The Effect of Caffeine on Test Performance

Genetic Drift And The Founder Effect

Genetic Drift, Bottleneck Effect and Founder Effect

Genetic Drift, Founder Effect, Bottleneck Effect.

The effect of differential item functioning in anchor items on population invariance of equating

Scale/Index Construction

TURBIDITY and THE EFFECT OF SCALE

“Anchor of the Soul”

Stages of test construction

TEST CONSTRUCTION

The Effect of the Type of Test on Test Placement

Effect of v drift tuning on reconstruction

The Effect of…on…..

The Effect of Test Specimen Size on Graphite Strength

The Effect of Seawall Construction on the Coastline