1 / 49

Overview

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction. Overview. Scaling Definition Purposes Equating Definition Purposes Designs Procedures

moira
Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling and EquatingJoe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction

  2. Overview • Scaling • Definition • Purposes • Equating • Definition • Purposes • Designs • Procedures • Vertical Scale

  3. What is Scaling? • Scaling is the process of associating numbers with the performance of examinees • What does 400 mean in WASL? It is not a raw score but a scaled score.

  4. Primary Score Scale • Many educational tests use one primary score scale for reporting scores • Raw scores, scaled scores, percentile • WASL and WLPT-II use scaled scores

  5. Activity Grade 3 Mathematics Items

  6. G3 Math Items

  7. Why Use a Scaled Score? • Minimizing misinterpretations e.g. Emmy got 30 points last year and met the standard. I got 31points this year but did not meet the standard. Why? The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?

  8. Why Use a Scale Score? • Facilitate meaningful interpretation • Comparison of examinees’ performance on different forms • Tracking of trends in group performance over time • Comparison of examinees’ performance on different difficulty levels of a test

  9. Raw Score and Scaled Score • Linearly (Monotonic) related • Based on Item Response Theory Ability Scale • Each observed performance is corresponding to an ability value (theta) • Scaled score = a + b *(theta)

  10. Linear Transformation Simple linear trasformation: Scaled Score= a + b*(ability) Two parameters are used to describe that relationship: a and b. We obtain some sample data and find the values of a and b that best fit the data to the linear regression model.

  11. WASL 400 = a + b*(theta 1) 375 = a + b*(theta 2) • Theta 1 and theta 2 are established by the standard setting committees. • a and b are determined by solving the equations above.

  12. WLPT-II • Min Scaled Score = 300 • Max Scaled Score = 900 300 = a + b*(theta 1) 900 = a + b*(theta 2)

  13. WASL Scaling • 375 is the cut between level 1 and level 2 for all grade levels and content areas • 400 is the cut between level 2 and level 3 for all grade levels and content areas. • Each grade/content has a separate scale (WASL) • All grade levels are in the same scale (WLPT-II) - vertically linked

  14. WASL G 3 G 4 G 5 G 6 HS G 7 G 8 400 375

  15. WLPT-II (Vertical Scale) 900 300 K 1 2 3 4 5 6 7 8 9 10 11 12

  16. Equating

  17. Purpose of Equating • Large scale testing programs use multiple forms of the same test • Differences in item and test difficulties across forms must be controlled • Equating is used to ensure that scale scores are equivalent across tests

  18. Requirements of Equating Four necessary conditions for equating (Lord, 1980): • Ability - Equated tests must measure the same construct (ability) • Equity – After transformation, the conditional frequencies for each test are same • Population invariance • Symmetry

  19. Ability - Equated Tests Must Measure the Same Construct (Ability) • Item and test specifications are based on definitions of the abilities to be assessed • Item specifications define how the abilities are shown • Test specifications ensure representation of all aspects of the construct • Tests to be equated should measure the same abilities in the same ways

  20. Equity • Scales on the tests to be equated should be strictly parallel after equating • Frequency distributions should be roughly equivalent after transformation

  21. Population Invariance • The outcome of the transformation must be the same regardless of which group is used as the anchor • If score Y1 on Y is equated to score X1 on X, the result should be the same as if score X1 is equated to score Y1 • If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)

  22. Symmetry • The function used to transform the Y scale to the X scale is the inverse of the function used to transform the X scale to the Y scale • If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale

  23. Equating Design Used in WASL • Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995) • A set of items in common (anchor items) • Different groups of examinees (in different years)

  24. Equating Method • Item Response Theory Equating uses a transformation from one scale to the other • to make score scales comparable • to make item parameters comparable

  25. Equating of WASL • The items on a WASL test differ from year-to-year (within grade and content area) • Some items on the WASL have appeared in earlier forms of the test, and item calibrations (“b” difficulty/step values) were established. These are called “Anchor Items”. • Each year’s WASL is equated to the previous year’s scale using these anchor items.

  26. Equating Procedure • Identify anchor item difficulties from bank. • Calibrate all items on current test form without fixing anchor item difficulties. • Calculate mean of anchor items using bank difficulties. • Calculate mean of anchor items using calibrated difficulties from current test • Add constant to current test difficulties so the mean equals mean from bank values.

  27. Equating Procedure • For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant). • Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item. • Repeat steps 3-7 using remaining anchor items.

  28. Equating Example • Item Calibrations before equating (Anchor items flagged on right with “Y”

  29. Equating Example • Item #17 was removed as an anchor item; other anchors were kept.

  30. Equating Example • Item Calibrations after equating (Anchor items fixed with “A” in Measure column

  31. Transformed ScoresRaw-to-Theta-to-Scale Procedures • Calibration software provides a Raw-to-Theta look-up table. • Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee: (L2)  375 (L3)  400 (L4)  SS, obtained by solving for (L4) in SS=m*+b derived from (L2) and (L3)

  32. Transformed Scores Example • In Grade 4 Mathematics, the Standard Setting Committee established the following cut-scores: • Setting (L2) = 375 and(L3) = 400, establishes this Theta-to-SS formula: SS = (37.76435 * ) + 378.3988 • Solving for (L4), SS(L4) = 427.115

  33. Theta-to-SS Transformations • The current Theta-to-SS transformations:

  34. Transformed Scores • Raw-to-Scale Score table from equating report

  35. How to Determine Cut Score (Until 2006) • If there is 400, the cut score is 400 • If 400 does not exist, the nearest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 401 is the cut score - 399, 402, 405: 399 is the cut score

  36. How to Determine Cut Score (2007) • If there is 400, the cut score is 400 • If 400 does not exist, the next lowest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 398 is the cut score - 399, 402, 405: 399 is the cut score

  37. Vertical Scaling

  38. Vertical Scale • Examinee performance across grade levels on a single scale • Measure individual student growth • Locate all items across grade level on a single scale • Proficiency standard from different grade levels to a single scale

  39. Vertical Scaling vs. Equating • Equating: scores on different test forms to be used interchangeably within grade level • Vertical scaling: • Performance across all grade levels on the same scale • Measure students’ growth • Not equating

  40. Data Collection Design • Common item design • Common items between adjacent grade levels • Select appropriate level items to each grade • Equivalent group design • Same examinees • Take on-grade test or off-grade test (usually lower grade test)

  41. Common Item Design (WASL)

  42. Previous Vertical Linking Study • Math in Grades 3, 4, and 5 • Purpose of the study • How much are students growing over time? • What is the precision of these estimates?

  43. Data • The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005 • Operational data for Grade 4 in 2005

  44. Linking Design • Items across all forms in three grades • Each form within grade includes a common block of items • Common item non-equivalent groups design

  45. Common Item Design (WASL)

  46. Item Review (Item Means)

  47. Item Review

  48. Results • Comparing the p-values for the linking items across grades suggests some instability • Growth is larger from grades 3 to 4 than grades 4 to 5 • Pilot data vs. operational data • Motivation factor (G4 to G5) • Backward Equating

  49. Future Plan • Vertical linking study will be conducted in January 2008 using the 2007 reading WASL. • The results will be presented next year.

More Related