AEA Meeting, October 17, 2013 Assessment in Higher Education TIG Do your results really say what you think they say?Issues of reliability and validity in evaluation measuring instruments Krista S. Schumacher, PhD student & Program Evaluator Oklahoma State University | JCCI Resource Development Services
Key Issue “Unfortunately, many readers and researchers fail to realize that no matter how profound the theoretical formulations, how sophisticated the design, and how elegant the analytic techniques, they cannot compensate for poor measures” (Pedhazur & Pedhazur Schmelkin, 1991).
The Problem Review of 52 educational evaluation studies: 1971 to 1999 (Brandon & Singh, 2009) • None adequately addressed measurement • Lacking in research on practice of evaluation • Literature on validity in evaluation studies ≠ measurement validity (Chen, 2010; Mark, 2011)
The Problem (cont.) Federal emphasis on “scientifically based research” • Experimental design • Quasi-experimental design • Regression discontinuity design, etc. Where is measurement validity? How can programs be compared? How can we justify requests for continued funding?
Program Evaluation Standards: Accuracy Standard A2: Valid Information • “Evaluation information should serve the intended purposes and support valid interpretation” (p. 171). Standard A3: Reliable Information • “Evaluation procedures should yield sufficiently dependable and consistent information for the intended users” (p. 179). (Yarbrough, Shulha, Hopson, & Caruthers, 2011)
Measurement Validity & Reliability Defined Valid Inferences = • Validity • Instrument measures intended construct • Reliability • Instrument consistently measures a construct • But perhaps not the construct • Reliability ≠ Validity • Consistent scores across administrations
Validity Types(basic for evaluation) Face • On its face, instrument seems to measure intended construct • Assessment: Subject Matter Experts (SME) ratings Content • Items representative of domain of interest. • Assessment: SME ratings • Provides no information for validity of inferences about scores Construct • Instrument content reflects intended construct • Assessment: Exploratory factor analysis (EFA), principal components analysis (PCA)
Understanding Construct Validity Pumpkin Pie Example Construct Pie Factors Crust and filling Variables (items) Individual ingredients (Nassif & Khalil, 2006)
Validity Types(more advanced) Criterion • Establishes relationship or discrimination • Assessment: Correlation of scores with other test or with outcome variable • Types of criterion validity evidence • Concurrent validity • Positive correlation with scores from another instrument measuring same construct • Discriminant validity • Negative correlation with scores from another instrument measuring opposite construct; comparing scores from different groups • Predictive validity • Positive correlation of scores with criterion variable test is intended to predict • E.g., SAT scores and undergraduate GPA
Reliability(basic for evaluation) Measure of error (or results due to chance) Internal Consistency Reliability (one type of reliability) • Cronbach’s coefficient alpha (most common) • Correlation coefficient: • +1 = high reliability, no error • 0 = no reliability, high error • ≥ .70 desired (Nunnally, 1978) • Nota measure of dimensionality • If multiple scales (or factors), compute alpha for each scale
Psychometrically Tested Instrument in Evaluation: Example Middle Schoolers Out to Save the World (Tyler-Wood, Knezek, & Christensen, 2010) • $1.6 million • NSF Innovative Technology Experiences for Students and Teachers (ITEST) • STEM attitudes & career interest surveys Process • Adapted existing psychometrically tested instruments • Instrument development discussed • Validity and reliability evidence included • Instruments published in article
Middle Schoolers Out to Save the World: Validity & Reliability Content validity • Subject matter experts • Teachers; advisory board members Construct validity • Principal components analysis Criterion-related validity • Concurrent: Correlated scores with other instruments tested for validity and reliability • Discriminant: Compared scores among varying groups (e.g., 6th graders vs. ITEST PIs)
Middle Schoolers Out to Save the World: Reliability Internal Consistency Reliabilities for Career Interest Scales
Evaluations Lacking Instrument Validity & Reliability • Six evaluations reviewed • Approx. $9 million in federal funding • NSF programs: • STEM Talent Expansion Program (STEP) • Innovative Technology Experiences for Science Teachers (ITEST) • Research in Disabilities Education • All used evaluator-developed instruments
Purpose of Sample Evaluation Instruments Instruments intended to measure: • Attitudes toward science, technology, engineering & math (STEM) • Anxiety related to STEM education • Interest in STEM careers • Confidence regarding success in STEM major • Program satisfaction
Measurement Fatal Flaws in Sample Evaluations Failed to: • Discuss process of instrument development • How were items developed? • Were they reviewed by anyone other than evaluators? • Report reliability or validity information • Evaluations that included existing instruments did not report results of psychometric testing • One used different instruments for pre/post tests • How can claims of increases or decreases be made when different items are used?
Reported Findings of Sample Evaluations • IEP students less likely than non-IEP peers to be interested in STEM fields(Lam et al., 2008) • Freshman seminar increased perceived readiness for following semester (Raines, 2012) • Residential program increased STEM attitudes and career interests (Lenaburg et al., 2012) • Participants satisfied with program (Russomanno et al, 2010) • Increased perceived self-competence re: information technology (IT) (Hayden et al., 2011) • Improved perceptions of IT professionals among high school faculty (Forssen et al., 2011)
Implications for Evaluation • Funding and other program decisions • Findings based on valid and reliable data provide strong justifications • Use existing (tested) instruments when possible • Assessment Tools in Informal Science • http://www.pearweb.org/atis/dashboard/index • Buros Center for Testing (Mental Measurements Yearbook) • http://buros.org/ • For newly created instruments • Discuss process of instrument creation • Report evidence of validity and reliability
Conclusion No more missing pieces • Measurement deserves a place of priority Continually ask... • Are the data trustworthy? • Are my conclusions justifiable? • How do we know these results really say • what we think they say?
References Brandon, P. R., & Singh, J. M. (2009). The strength of the methodological warrants for the findings of research on program evaluation use. American Journal of Evaluation, 30(2), 123-157. Chen, H. T. (2010). The bottom-up approach to integrative validity: A new perspective for program evaluation. Evaluation and Program Planning, 33, 205-214. Forssen, A., Lauriski-Karriker, T., Harriger, A., & Moskal, B. (2011). Surprising Possibilities Imagined and Realized through Information Technology: Encouraging high school girls' interests in information technology. Journal of STEM Education: Innovations & Research, 12(5/6), 46-57. Hayden, K., Ouyang, Y., Scinski, L., Olszewski, B., & Bielefeldt, T. (2011). Increasing student interest and attitudes in STEM: Professional development and activities to engage and inspire learners. Contemporary Issues in Technology and Teacher Education, 11(1), 47-69. Lam, P., Doverspike, D., Zhao, J., Zhe, J., & Menzemer, C. (2008). An evaluation of a STEM program for middle school students on learning disability related IEPs. Journal of STEM Education: Innovations & Research, 9(1/2), 21-29. Lenaburg, L., Aguirre, O., Goodchild, F., & Kuhn, J.-U. (2012). Expanding Pathways: A Summer Bridge Program for Community College STEM Students. Community College Journal of Research and Practice, 36(3), 153-168. Mark, M. M. (2011). New (and old) directions for validity concerning generalizability. New Directions for Evaluation, 2011(130), 31-42. Nassif, N., & Khalil, Y. (2006). Making a pie as a metaphor for teaching scale validity and reliability. American Journal of Evaluation, 27(3), 393-398. Nunnally, J. (1978). Psychometric theory. New York, NY: McGraw-Hill. Pedhazur, E. J., & Pedhazur Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. New York, NY: Psychology Press. Raines, J. M. (2012). FirstSTEP: A preliminary review of the effects of a summer bridge program on pre-college STEM majors. Journal of STEM Education : Innovations and Research, 13(1). Russomanno, D., Best, R., Ivey, S., Haddock, J. R., Franceschetti, D., & Hairston, R. J. (2010). MemphiSTEP: A STEM Talent Expansion Program at the University of Memphis. Journal of STEM Education : Innovations and Research, 11(1/2), 69-81. Tyler-Wood, T., Knezek, G., & Christensen, R. (2010). Instruments for assessing interest in STEM content and careers. Journal of Technology and Teacher Education, 18(2), 341-363. Yarbrough, D. B., Shulha, L. M., Hopson, R. K., & Caruthers, F. A. (Eds.). (2011). The program evaluation standards: A guide for evaluators and evaluation users (3rd ed.). Thousand Oaks, CA: Sage.
Contact Information JCCI Resource Development Services http://www.jccionline.com BECO Building West 5410 Edson Lane - Suite 210B Rockville , MD 20852 Jennifer Kerns, President 301-468-1851 | email@example.com Krista S. Schumacher, Associate 918-284-7276 | firstname.lastname@example.org