1 / 18

CRT Dependability

CRT Dependability. Consistency for criterion-referenced decisions. Challenges for CRT dependability. Raw scores may not show much variation (skewed distributions) CRT decisions are based on acceptable performance rather than relative position

libitha
Download Presentation

CRT Dependability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRT Dependability Consistency for criterion-referenced decisions

  2. Challenges for CRT dependability • Raw scores may not show much variation (skewed distributions) • CRT decisions are based on acceptable performance rather than relative position • A measure of the dependability of the classification (i.e., master / non-master) is needed

  3. Approaches using cut-score • Threshold loss agreement • In a test-retest situation, how consistently are the students classified as master / non-master • All misclassifications are considered equally serious • Squared error loss agreement • How consistent are the classifications • The consequences of misclassifying students far above or far below cut-point are considered more serious Berk, R. A. (1984). Selecting the index of reliability. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 231-266). Baltimore, MD: The Johns Hopkins University Press.

  4. Issues with cut-scores • “The validity of the final classification decisions will depend as much upon the validity of the standard as upon the validity of the test content” (Shepard, 1984, p. 169) • “Just because excellence can be distinguished from incompetence at the extremes does not mean excellence and incompetence can be unambiguously separated at the cut-off.” (p. 171) Shepard, L. A. (1984). Setting performance standards. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 169-198). Baltimore, MD: The Johns Hopkins University Press.

  5. Methods for determining cut-scores • Method 1: expert judgments about performance of hypothetical students on test • Method 2: test performance of actual students

  6. Setting cut-scores (Brown, 1996, p. 257)

  7. Institutional decisions (Brown, 1996, p. 260)

  8. (p – pchance) K= (1 – pchance) Agreement coefficient (po), kappa Po = (A + D) / N 77 6 83 6 21 27 83 27 110 Po = (A + D) / N Pchance = [(A+B)(A+C)+(C+D)(B+D)]/N2 Po = (77+21) / 110 K = (.89 - .63) / (1 - .63) K = .70 Po = .89

  9. Short-cut methods for one administration • Calculate an NRT reliability coefficient • Split-half, KR-20, Cronbach alpha • Convert cut-score to standardized score • Z = [(cut-score - .5 – mean)] / SD • Use Table 7.9 to estimate Agreement • Use Table 7.10 to estimate Kappa

  10. Estimate the dependability for the HELP Reading test Assume a cut point of 60%. What is the raw score? 27 z = -0.36 Look at Table 9.1. What is the approximate value of the agreement coefficient? Look at Table 9.2. What is the approximate value of the kappa coefficient?

  11. Squared-error loss agreement • Sensitive to degrees of mastery / non-mastery • Short-cut form of generalizability study • Classical Test Theory • OS = TS + E • Generalizability Theory • OS = TS + (E1 + E2 + . . . Ek) Brennan, Robert (1995). Handout from generalizability theory workshop.

  12. Phi (lambda) dependability index # of items Cut-point Mean of proportion scores Standard deviation of proportion scores

  13. Domain score dependability • Does not depend on cut-point for calculation • “estimates the stability of an individual’s score or proportion correct in the item domain, independent of any mastery standard” (Berk, 1984, p. 252) • Assumes a well-defined domain of behaviors

  14. Phi dependability index

  15. Confidence intervals • Analogous to SEM for NRTs • Interpreted as a proportion correct score rather than raw score

  16. Reliability Recap • Longer tests are better than short tests • Well-written items are better than poorly written items • Items with high discrimination (ID for NRT, B-index for CRT) are better • A test made up of similar items is better • CRTs – a test that is related to the objectives is better • NRTs – a test that is well-centered and spreads out students is better

More Related