Yeow Meng Thum Hye Sook Shin

Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng ThumHye Sook Shin UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST) CRESST Conference 2004 Los Angeles

Rationale • Research shows that cut-scores vary as a function of many factors: raters, procedures, and over time. • How does one defend a particular cut-score? Averaging several values, use of collateral information are current options. • High-stakes accountability hinges on the comparability of performance standards over time. • Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)

Purpose of Study • An approach for estimating the impact from procedural factors and rater characteristics and time. • Monitoring the consistency of cut-scores across several groups.

Transforming Judgments into Scale Scores • Figure 1: Working with the Grade 3 SAT-9 mathematics scale

Performance Distributionfor Four Urban Schools • Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools

Potential Impactof Revising a Cut-score • Table 1: Potential impact on school performance when cut-score changes

Data & Model • Simulate Data for a standard setting study design : a ramdomized block comfounded factorial design (Kirk, 1995) • Factors of standard setting study • Rater Dimensions (Teacher, Non-Teacher, etc.) • Procedural Factors/Treatments • Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc) • Item Sampling in Booklet (Number of items, etc) • Type of Task (A modified Angoff, a contrasting group approach, or Bookmark method, etc)

Treating Binary Outcomes • Binary outcome • (1) • (pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t) • Logit link function • (2)

IRT Model for Cut-score - I • Item Response Model (IRT) • (3) • Procedural Factors Impacting A Rater’s Cut-scores • (4) • Where • is the fixed effect due to session characteristics s • is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj

IRT Model for Cut-score - II • Estimating Factors Impacting A Rater’s Cut-scores • (5) • are distributed bivariate normal with means (0, 0) and variance-covariances

Likelihood Condition on , y has probability Prior distribution of j • (6) Conditional posterior of the rater random effects j is • (7) where Joint marginal likelihood • (8)

Multiple StudiesConsistency & Stability • Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g) • (9) • Where • is the fixed effect due to session characteristics s • is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj • Group Factors Impacting A Rater’s Severity • (10)

Simulation • SAS Proc NLMixed • 150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors. • Session Factor 1: Feedback type • Session Factor 2: Item Targeting in Booklet • Session Factor 3: Type of Standard Setting Task • Rater Characteristics: Teacher, Non-Teacher • Change over Round (time)

Selected Results • Model (reasonably) recovers parameters within sampling uncertainty across 3 studies. • Average cut-score (All Teachers) for each rater group at the last Roundis not significantly different from 619, while the first Round results were significantly different. • Results from the model for multiple studies are similarly encouraging.

Suggestions • Large-scale testing programs should monitor their cut-score estimates for consistency and stability. • For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time. • The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.

Yeow Meng Thum Hye Sook Shin