1 / 35

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways

This paper discusses the limitations of using p-values to understand the effects of education interventions and proposes alternative approaches for meaningful characterization.

kleger
Download Presentation

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry, Mikel Cole, Megan Roberts, Karen Anthony, Matthew Busick Vanderbilt University And also Howard Bloom, Carolyn Hill, & Alison Black IES Research Conference Washington, DC June 2010

  2. Intervention research model Compare treatment (T) sample with control (C) sample on education outcome measure Description of the intervention effect that results from this comparison: Means on outcome measure for T and C samples; difference between means p-values for statistical significance of the difference between

  3. Problem to be addressed The native statistical findings that represent the effect of an intervention on an education outcome often provide little insight into the nature, magnitude, or practical significance of the effect Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful

  4. Example Intervention: vocabulary-building program Samples: fifth graders receiving (T) and not receiving (C) the program Outcome: CAT5 reading achievement test Mean score for T: 718 Mean score for C: 703 Difference between T and C means: 15 points p-value: <.05 [! Note– not an indicator of magnitude of effect!] Questions: Is this a big effect or a trivial one? Do the students read a lot better now, or just a little better? If they were poor readers before, is this a big enough effect to now make them proficient readers? If they were behind their peers, have they now caught up? Someone intimately familiar with the CAT5 scoring may be able to look at the means and answer such questions, but most of us haven’t a clue.

  5. Two approaches to review here Descriptive representations of intervention effects: Translations of the native statistical results into forms that are more readily understood Practical significance: Assessing the magnitude of intervention effects in relationship to criteria that have recognized value in the context of application

  6. Useful Descriptive Representations of Intervention Effects

  7. Representation in terms of the original metric Often inherently meaningful, e.g.: proportion of days student was absent number of suspensions or expulsions proportion of assignments competed Covariate adjusted means (baseline diffs; attrition) Pretest baselines and differential pre-post change (example on next slide)

  8. Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self-report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference

  9. Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self-report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference

  10. Effect size Typically the standardized mean difference ES ESd=Δ/σ

  11. Utility of effect size Useful for comparing effects across studies with ‘same’ outcome measured differently Somewhat meaningful to researchers But not very intuitive; provides little insight into nature and magnitude of effect, esp for nonresearchers Often reported in relation to Cohen’s guidelines for ‘small,’ ‘medium,’ and large BAD IDEA

  12. Notes and quirks about ESs Better with covariate adjusted means Don’t adjust variance/SD– concept of standardization Issue of the variance on which to standardize Effect sizes standardized on variance/SD other than between individuals Effect size from multilevel analysis results

  13. Proportions of T and C samples above or below a threshold score

  14. Cohen U3 overlap index .73 σ 50% above C mean 77% above C mean Adapted from Redfield & Rousseau, 1981

  15. Rosenthal & Rubin BESD d = .80

  16. Proportion reaching or exceeding a performance threshold

  17. Proportion reaching or exceeding a performance threshold

  18. Options for threshold values Mean of control sample (U3) Grand mean of combined T and C samples (BESD) Predefined performance threshold (e.g., NAEP) Other possibilities: Mean of norming sample, e.g., standard score of 100 on PPVT Mean of reference group with ‘gap,’ e.g., students who don’t qualify for FRPL, majority students Study determined threshold, e.g., score at which teachers see behavior as problematic Target value, e.g., achievement gain needed for AYP Any other identifiable score on the measure that has interpretable meaning within the context of the intervention study

  19. Conversion to grade equivalent (and age equivalent) scores Mean Reading Grade Equivalent (GE) Scores of Success for All (SFA) and Control Samples [from Slavin et al., 1996]

  20. Characteristics and quirks of grade equivalent scores Provided (or not) by test developer [Note: could be developed by researcher for context of intervention study] Vary from X.0 to X.9 over 9 month school year Not criterion-referenced; estimates from empirical norming sample Imputed where norming data are thin, esp for students outside grade range Nonlinear relationship to test scores, e.g., given GE difference in early grades is larger score difference than in later grades, but greater within variation in later grades

  21. Practical Significance: Criterion Frameworks for Assessing the Magnitude of Intervention Effects

  22. Practical significance must be judged in reference to some external standard relevant to the intervention context • E.g., compare effect found in study with: • Effects others have found on similar measures with similar interventions • Normative expectations for change • Policy-relevant performance gaps • Intervention costs (not discussed here)

  23. Cohen Small = 0.20 s Medium = 0.50 s Large = 0.80 s Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2nd edition (Hillsdale, NJ: Lawrence Erlbaum). Lipsey Small = 0.15 s Medium = 0.45 s Large = 0.90 s Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research (Newbury Park, CA: Sage Publications). Cohen’s rules of thumb for interpreting effect size: Normative but overly broad

  24. Effect sizes for achievement from random assignment studies of education interventions • 124 random assignment studies • 181 independent subject samples • 831 effect size estimates

  25. Achievement effect sizes by grade level and type of achievement test

  26. Achievement effect sizes by grade level and type of achievement test

  27. Achievement effect sizes by target recipients

  28. Normative expectations for change:Estimating annual gains in effect size from national norming samples for standardized tests • Up to seven tests were used for reading, math, science, and social science • The mean and standard deviation of scale scores for each grade were obtained from test manuals • The standardized mean difference across succeeding grades was computed • These results were averaged across tests and weighted according to Hedges (1982)

  29. Annual reading growth Reading Grade Growth Transition Effect Size ----------------------------------- K - 1 1.52 1 - 2 0.97 2 - 3 0.60 3 - 4 0.36 4 - 5 0.40 5 - 6 0.32 6 - 7 0.23 7 - 8 0.26 8 - 9 0.24 9 - 10 0.19 10 - 11 0.19 11 - 12 0.06 -------------------------------------------------- Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10.

  30. Policy-relevant demographic performance gaps • Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups • Effect size gaps for groups may vary across grades, years, tests, and districts

  31. Policy-relevant performance gaps between “average” and “weak” schools Main idea: • What is the performance gap (in effect size) for the same types of students in different schools? Approach: • Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status. • Infer performance gap (in effect size) between schools at different percentiles of the performance distribution

  32. In conclusion … • The native statistical form for intervention effects provides little understanding of their nature or magnitude • Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers • There are a number of easily applied translations that could be routinely used in reporting intervention effects • The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context • Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct

More Related