Modeling Student Growth Using Multilevel Mixture Item Response Theory

Modeling Student Growth Using Multilevel Mixture Item Response Theory Hong Jiao Robert Lissitz University of Maryland Presentation at the 2012 MARCES Conference October 2012

Thanks to Yong Luo, Chao Xie, and Ming Li for feedback

Outline of presentation • Value-added modeling • Multilevel IRT models • Mixture IRT models • Direct modeling of students’ growth parameters in multilevel mixture IRT models • Simulation for direct modeling of growth in IRT models • Future explorations

Value-added modeling • VAM intends to estimate the effect of educational inputs on student outcomes or student achievement as measured by standardized tests. (McCaffrey et al. 2003) • Accurate estimation of students’ achievement is very important as high stakes decisions are associated with the use of such scores. • All value-added models estimate the growth associated with schools and/or teachers • To measure growth, some models control for students’ prior achievement (FL commissioned paper by AIR)

Complexity of the VAM • How prior achievements are accounted for • How value-added scores of school and teacher effects are estimated • Assumptions about the sustainability of school and teacher effects • Value-added models can be grouped into two major classes (AIR): • Typical learning path models • Covariate adjustment models

Typical learning path models-longitudinal mixed-effects mdoels • Each student is assumed to have a typical learning path • Schools and teachers can alter this learning path relative to the state mean, a conditional average • No direct control of prior achievement • With more data points, a student’s propensity to achieve can be estimated with more accuracy • With each passing year, a student’s typical learning path can be estimated with increased precision over time • Different learning path models assume differently about how teachers and schools can impact a student’s propensity to achieve

Different learning path models • Sander’s Tennessee value added assessment system (TVAAS) model, teacher effects are assumed to have a permanent impact on students • McCaffrey and Lockwood (2008) relaxed this assumption and let the data dictate the extent to which teacher effects decay over time • Kane et al. (2008) found that teacher effects appear to dissipate over the course of about two years in an experiment in Los Angeles

Covariate adjustment models • Direct control of prior student scores, prior test scores are included as predictors in the model • Teacher effects can be treated as either fixed or random • To obtain unbiased estimates, covariate adjustment models must account for measurement error introduced by the inclusion of model predictors-students’ prior achievement

Covariate adjustment models • Two frequently used methods for accounting for measurement error in regression analysis include • Direct modeling of error such as in structural equation models or errors-in-variables regression • Instrumental variable approach using one or more variables that are assumed to influence the current year score, but not prior year scores to statistically purge the measurement error from the prior year scores

Statistical controls for contextual factors • Students are not randomly assigned to districts, schools, and classes • Parent selection of schools and teachers, teacher selection of schools, subjects, and sections, principal discretion in assigning certain students to certain teachers • These selection factors cause significant biases • Unbiased estimates of teacher value-added controls the factors that influence both selection of students into particular classes and current year test scores

Statistical controls for contextual factors • Many value-added models assume only students’ prior test score is relevant to students’ posttest score • Other models incorporate controls for additional variables that might influence selection and outcomes

Statistical controls for contextual factors • Empirical evidence is mixed on the extent to which student characteristics other than score histories remain correlated with test scores after controlling for prior test scores • Some studies found that controlling for student-level characteristics makes little if any significant difference in model estimates (Ballou, Sanders, and Wright, 2004; McCaffrey et al. 2004) • This is consistent with the view that durable student characteristics associated with race, income, and other characteristics are already reflected in prior test scores, such that controlling for the prior test scores controls for any relevant impact of the factors proxied by the measured characteristics

Statistical controls for contextual factors • In contrast, when student factors are aggregated to school or classroom levels, they sometimes reveal a significant residual effect (Raudenbush, 2004; Ballou, Sanders, & Wright, 2004). School or classroom characteristics may explain additional variance in students’ posttest scores independently beyond students’ individual characteristics accounted for by their prior test scores • True teacher effectiveness really does vary with student characteristics and correlated variation of estimated teacher value-added is not the consequence of uncontrolled selection bias but rather a reflection of these true differences in teacher effectiveness.

Durability of teacher effects • Typical learning path models require an assumption about the durability of the impact of teachers on a student’s learning path. • Sanders’ Tennessee value-added assessment system assume that teacher effects have a permanent impact on students • McCaffrey & Lockwood (2008) let the data dictate the extent to which teacher effects decay over time • Kane et al. (2008) found that teacher effects appeared to dissipate over the course of about two years in an experiment in LA.

Durability of teacher effects • Covariate models do not make assumption about the durability of teacher effects as they explicitly establish expectations based on prior achievement by including prior test scores as a covariate, rather than the abstract ‘propensity to achieve’ estimated in learning path models

Unit of measurement for student achievement • Colorado growth model (Betebenner, 2008) uses entirely normative in-state percentile ranks • Not rely on a potentially flawed vertical scale • But only provide normative criteria • Students’ growth is examined relative to their peers rather than absolute growth in their own learning.

Dependent variable in growth modeling • Majority used interval measures of students; scaled test score • Student percentile ranks within the student’s grade was also used as the dependent variable in some models

Correction of biased estimates of teacher effects in VAM • Selection effects include parent selection of schools and teachers; teacher selection of schools, subjects, and sections; and principal discretion in assigning certain students to certain teachers • Selection effects can be mitigated when the model includes factors that are not accounted for by pretest scores, and are associated with posttest scores after controlling for pretest scores.

Issues arising from the use of achievement test scores as an outcome measure • Testing is infrequent-once a year • Tests sample all topics related to achievement • The scale for measuring achievement is not predetermined by the nature of achievement but is chosen by the test developer. • Changes to the timing of tests, the weight given to alternative topics, or the scaling of the test could change our conclusions about the relative achievement or growth in achievement across classes of students.

Potential problems in value-added models • Linking errors could be conflated with teacher effects • Equal interval property of the scale across grades was questionable. • Ceiling effects at higher grades may lead to smaller learning gains than grades in the middle scale. • Measurement errors cause estimated treatment effects confounded with group means of prior achievement (Lockwood, 2012)

Covariate Adjusted Models (McCaffrey, et al. 2003) the student’s score at time t a student-specific mean the student’s score at time t-1 the teacher effect the error term assumed to be normally distributed and independent of

Gain Score Models (McCaffrey et al. 2003) the student’s score at time t a student-specific mean the student’s score at time t-1 the teacher effect the error term assumed to be normally distributed and independent of The gain score model can be viewed as a special case of the covariate adjusted model, where b, the coefficient of prior achievement, is equal to 1.

Teacher Effect Estimate in VAM (Luo, Jiao, & Van Wie,2012) • Two-step process: • In most value-added modeling, student achievement scores are estimated before entering the model for estimating teacher or school effect. • Students’ achievement scores are estimated based on a certain item response theory (IRT) model first, most often a unidimensional IRT model.

Issues with Two-Step Process (Luo, Jiao, & Van Wie, 2012) • Standard IRT models are used in operation to measure students’ achievement scores. • Non-random assignment of students into schools and classes cause local person dependence due to the nesting structure (Reckase, 2009, Jiao et al. 2012). • Measurement precision might be affected • Parameter estimates may be biased due to the reduced effective sample size (Cochrane, 1977; Cyr & Davies, 2005; Kish, 1965). • Ultimately, the accuracy in estimating teacher and school effect may also be affected.

Outcome variables in VAM • Standardized test scores • Intrinsic measurement errors in the test scores • Possible solution is to use multilevel item response theory (IRT) models • Simultaneous modeling of students’ achievement, teacher effects, and school effects using item response data as the input data and the latent ability is simultaneously estimated with other model parameters such as item parameters and teacher and school random-effects.(van Wie, Luo, & Jiao, 2012; Luo, Jiao, & van Wie, 2012)

Four level model in the traditional Rasch model format (Van Wie, Luo, & Jiao, 2012)

Four level model in the traditional 3PL IRT model format (Luo, Jiao, & Van Wie, 2012)

Multilevel IRT Framework • Model parameter estimation of the 4-level IRT models: item parameters, student ability, teacher effect, and school effect. • Rasch: HLM7, ProcGlimmix, MCMC • 2pl: MCMC • 3pl: MCMC

Teacher Effect and School Effect Computation • In the two-step process, teacher effect is computed as the average of the scores of the nested students, and school effect is computed as the average of the teacher effects within the school. This is analogous to the status model. • In the 4-level IRT model, the student ability, the teacher effect and the school effect were simultaneously estimated.

Findings • Except for RMSE in teacher effect parameter estimation, the 4-level 3pl IRT model performs significantly better than the 2-level 3pl IRT model. • Especially noticeable is the considerable improvement of teacher effect parameter estimation.

Improvement of Teacher Effect Estimates • The improvement is especially noticeable when teacher effects and school effects are medium. • The improvement decreases with the decrease of teacher effects and school effects.

Further improvement • As the change score is ultimately used in evaluating teacher and school effects in several value-added models, we explored direct estimation of change score by including prior achievement scores in the IRT modeling. • An IRT model formulation for growth score is presented and model parameter estimation is explored. • A multilevel formulation is presented. • A mixture IRT version including growth score is presented and model parameter estimation is discussed.

Possible models • Rasch model with direct modeling of growth parameter • Multilevel Rasch model with direct modeling of growth parameter

Possible models • Multilevel Rasch mixture model with direct modeling of growth parameter with no latent classes at teacher and school levels • Multilevel Rasch mixture model with direct modeling of growth parameter with latent classes at teacher and school levels

Simulation Study • 30 items and 1000 examinees simulated

Model Parameter Estimation Using the Markov Chain Monte Carlo (MCMC) method implemented in OpenBUGS 3.0.7 , Priors: .

MCMC runs two chains used Initial values were generated by the program

Convergence Check Multiple criteria for convergence check The required number of iterations for equilibrium varied for different models The number of burn-in iterations: 40,000 iterations The model parameter inferences were made based on the 10,000 monitoring iterations for each chain with a total of 20,000 samples.

Growth parameter estimates

Correlations: growth parameter estimates

Future Research • Multilevel IRT model for direct estimation of the growth change scores. • A Mixture multilevel IRT model for direct estimation of the growth change scores • A constrained version of the model is possible by setting the growth change scores to non-negative values. • Extensions to other IRT models such as 2PL, 3PL-c, 3PL-d, and 4P IRT models and the mixture version of the models. • Replications and simulate more study conditions. • Model fit indexes to select among competing models should be investigated under more extensive study conditions.

Thank you!

Modeling Student Growth Using Multilevel Mixture Item Response Theory