1.14k likes | 1.3k Views
Statistical Considerations for Educational Screening & Diagnostic Assessments. A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education. Yaacov Petscher, Ph.D.
E N D
Statistical Considerations for Educational Screening & Diagnostic Assessments A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education Yaacov Petscher, Ph.D. Florida Center for Reading Research Florida State University
Discussion Points • Assessment Assumptions • Contexts of Assessments • Statistical Considerations • Reliability • Validity • Benchmarking • “Disclaimer” • Focusing on Breadth not Depth • Based on applied contract and grant research • One slide of equations
Assumptions of Assessment - Researchers Constructs exist but we can’t see them Constructs can be measured Although we can measure constructs, our measurement is not perfect There are different ways to measure any given construct All assessment procedures have strengths and limitations
Assumptions of Assessment - Practitioner Multiple sources of information should be part of the assessment process Performance on tests can be generalized to non-test behaviors Assessment can provide information that helps educators make better educational decisions Assessment can be conducted in a fair manner Testing and assessment can benefit our educational institutions and society as a whole
Contexts of Assessments • Instructional • Formative • Interim • Summative • Research • Individual Differences • Group Differences (RCT) • Growth • Legislative Initiatives • NCLB • Reading First • Race to the Top • Common Core
Within Common Core • USDOE • PARCC Assessments • Smarter Balanced Assessments • Reading for Understanding Assessments • I3 Assessments • Private Sector
Underlying “Code” of Assumptions Researcher Constructs exist but we can’t see them Constructs can be measured Although we can measure constructs , our measurement is not perfect There are different ways to measure any given construct All assessment procedures have strengths and limitations Practitioner Multiple sources of information should be part of the assessment process Performance on tests can be generalized to non-test behaviors. Assessment can provide information that helps educators make better educational decisions Assessment can be conducted in a fair manner. Testing and assessment can benefit our educational institutions and society as a whole.
Statistical Considerations - Reliability • Stability, accuracy, or consistency of test scores • Many types • Internal consistency • Retest • Parallel-form • Split-half • Should not be viewed as interchangeable • Once could have very high stability but very poor internal consistency • Date of Birth/Height/SSN
Statistical Considerations - Reliability T X e Most frequently used framework is classical test theory What does this assume?
Benefits of IRT • Puts persons and individuals on the same scale • CTT looks at total score by p-value (difficulty) • Can result in shorter tests • CTT reliability increases with more items • Can estimate the precision of scores at the individual level • CTT assumes error is the same
Statistical Considerations - Reliability • While precision improves on the idea of reliability, can precision be improved? • Account for context effects (Wainer et al., 2000) • Petscher & Foorman, 2011 • Account for time (Verhelst, Verstralen, & Jansen, 1997) • Prindle, Petscher, & Mitchell, 2013
Statistical Considerations - Reliability • Context effects • Any influence or interpretation that an item may acquire as a result of its relationship to other items • Greater problem in CAT due to unique testing • Emerges as an item and passage level problem
Statistical Considerations - Reliability Common stimulus
Statistical Considerations - Reliability “If several questions within a test a test are experimentally linked so that the reaction to one question influences the reaction to another, the entire group of questions should be treated preferably as an ‘item’ when the data arising from application of split-half or appropriate analysis-of-variance methods are reported in the test manual” APA Standards of Educational and Psychological Testing (1966)
Simulations are all well and good… How does accounting for item dependency improve testing in real world?
RCT • N ~= 800, randomly assigned to testing condition • Control was current 2pl scoring • Experimental was unrestricted bi-factor • Evaluate • Precision • # of passages • Prediction to state achievement
What this suggests “Newer” models help us to more appropriately model the data Precision/reliability are improved just by modeling the context effect Improve the efficiency and precision of a computer-adaptive test by modeling the item-dependency
Accounting for Time • Somewhat similar to the item dependency model • IRT models are concerned with accuracy • What about fluency? • CBM (DIBELS, AIMSweb, easyCBM) • Brief assessments (TOWRE, TOSREC, etc) • Prindle, Petscher, Mitchell (2013) • N = 200 • Word knowledge test • Limited to 60 sec • Compared 1pl with a 1pl-response time models
Results 1pl marginal α = .80 1pl-rt marginal α = .87
What this suggests • Accounting for response time of items can improve precision for most participants • Limitations • More difficult to do with younger children • Requires computer delivery to record accuracy and time • Cannot do with connected text
Statistical Considerations – Factor Validity • Assessments are measures of hypothetical constructs • Assessments are measured with error • Use latent variable to leverage the common variance • How is this modeled? • Unidimensional • Multidimensional • Three illustrations • Petscher & Foorman, 2012 (Syntactic Awareness) • Kieffer & Petscher, 2013 (Morphology/Vocabulary) • Justice, Petscher, & Pentimonti, 2013 (Early Literacy)
Morphological Awareness (MA) predicts Reading Comprehension (RC) For a while, we have known that MA is correlated with reading comprehension (e.g., Carlisle, 2000; Freyd & Baron, 1982; Tyler & Nagy, 1990) MA RC