270 likes | 282 Views
Plausible values, plausible transformations. Or, “some of my best friends are economists!” Discussion of Rothstein & von Davier. Andrew Ho, Harvard Graduate School of Education PIAAC Methodological Seminar, Organisation for Economic Co-operation Paris, France June 14, 2019.
E N D
Plausible values, plausible transformations. Or, “some of my best friends are economists!” Discussion of Rothstein & von Davier. Andrew Ho, Harvard Graduate School of Education PIAAC Methodological Seminar, Organisation for Economic Co-operation Paris, France June 14, 2019
Jacob & Rothstein (2016) vs. Braun & von Davier (2017) Andrew Ho, Harvard Graduate School of Education
Why are these LSAS issues important, now? • LSAS are Large Scale. In particular, LSAS target population-level inferences and comparisons (across subgroups, states, & countries). • LSAS are Low Stakes. They are held in high esteem by researchers and the public. They are not natural targets for political opposition or inflation. • LSAS are assessments. Not evaluations. They are designed for measurement. But they are (or would be!) natural tools for policy evaluations using current statistical and econometric techniques. • LSAS are oracular. Few understand how they work or what to do with available secondary data (plausible values). Andrew Ho, Harvard Graduate School of Education
Three Essential Questions • Are currently released plausible values useful for answering causal questions? • We can’t always tell, so, no. • Test score scales are not equal-interval. Is this a problem? • No more than many other scales, so, no. • What should we do about this? • For 1, allow select researchers access to item-level data. • For 2, assess plausible transformations as a specification check. Andrew Jesse Matthias Andrew Jesse Matthias Andrew Ho, Harvard Graduate School of Education
Andrew Ho, Harvard Graduate School of Education Source: https://nces.ed.gov/nationsreportcard/tdw/analysis/summary_proced_biases.aspx
From Braun & von Davier (2017) with my comments Note that the secondary analysis model is typically a subset of the latent regression model used to generate the PVs. I’m not sure we can assume this for some folks, especially policy analysts. However, if variables beyond those in the latent regression are used in a secondary analysis, then biased estimates may result (Mislevy, 1991; Meng, 1994). Yes, well known. On the other hand, since the PVs generating model typically includes as many factors as are available (“kitchen-sink approach”: Graham 2012), even these additional variables may be effectively included by proxy, to the extent that they are correlated with the variables incorporated in the latent regression. To what extent is this? Andrew Ho, Harvard Graduate School of Education
From Braun & von Davier (2017) with my comments JR also offer examples of situations where certain school-level characteristics are of interest but were not included in the conditioning model. Yes, this is the concern. In actual practice, this may not be a problem. Such characteristics are either drawn directly from items incorporated in the school questionnaire and are part of the conditioning, or indirectly, through inclusion of a dummy coded school identifier. It may not be a problem. If particular characteristics that become subsequently available are of interest, then supplementary latent regression models can be run to generate new PVs so as to ensure unbiased estimation. Yes, why not allow select folks to do that themselves? Andrew Ho, Harvard Graduate School of Education
Three Essential Questions • Are currently released plausible values useful for answering causal questions? • We can’t always tell, so, no. • Test score scales are not equal-interval. Is this a problem? • No more than many other scales, so, no. • What should we do about this? • For 1, allow select researchers access to item-level data. • For 2, assess plausible transformations as a specification check. Andrew Jesse Matthias Andrew Jesse Matthias Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Slope (Discrimination) Threshold (Difficulty) Latent Scale Probability (Correct) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
Wait, IRT does provide an equal-interval scale, if the model fits the data! Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
The scale is linear in the probits of correct responses to items. The scale renders normal the underlying response processes of respondents. A logit (log of the odds) approximates a probit: ProbitLogit*1.7 Threshold (Difficulty) Latent Scale Probability (Correct) Slope (Discrimination) Latent Scale () Latent Scale () Andrew Ho, Harvard Graduate School of Education
So IRT’s is a scale that is linear in the log of the odds of correct responses to items.* • This does NOT imply that the resulting scale has universal equal-interval properties. • The consensus view in educational measurement is that the scale from well-fit IRT models is convenient, not cardinal (Ho, 2009; Lord, 1980; Yen, 1986; Zwick, 1992). • Monotone transformations of the scale fit the data equally well. • But this is true of many equal-interval scales, as Braun & von Davier note. • The limited equal-interval properties of IRT’s makes it a good starting point from which to evaluate sensitivity to transformations (e.g., Reardon & Ho, 2015). Andrew Ho, Harvard Graduate School of Education *3PL model interpretations are less elegant.
And we know which analyses are scale-sensitive (Ho, 2008; Ho & Haertel, 2006). • A/B comparisons, whether treatment/control or focal/reference gaps, are not generally scale-sensitive. • A/B differences, whether interactions, gap trends, or differences-in-differences, are often scale-sensitive. Andrew Ho, Harvard Graduate School of Education
Gap trends are almost always transformation-reversible (Ho & Haertel, 2006; see Bond’s talk) Andrew Ho, Harvard Graduate School of Education
Three Essential Questions • Are currently released plausible values useful for answering causal questions? • We can’t always tell, so, no. • Test score scales are not equal-interval. Is this a problem? • No more than many other scales, so, no. • What should we do about this? • For 1, allow select researchers access to item-level data. • For 2, assess plausible transformations as a specification check. Andrew Jesse Matthias Andrew Jesse Matthias Andrew Ho, Harvard Graduate School of Education