The Types of Program Evaluation • Process evaluation • Audit and monitoring • Did the intended policy actually happen (2) Impact evaluation • What effect (if any) did the policy have?
Why Impact Evaluation ? • Knowledge is a global public good • Long term credibility • Help choosing best projects: build long term support for development
The evaluation problem and alternative solutions • Impact is the difference between the relevant outcome indicator with the program and that without it. • However, we can never simultaneously observe someone in two different states of nature. • So, while a post-intervention indicator is observed, its value in the absence of the program is not, i.e., it is a counter-factual.
Problems when Evaluation is not Built in Ex-Ante • Need a reliable comparison group • Before/After: Other things may happen • Units with/without the policy:May be different for other reasons than the policy (e.g. because policy is placed in specific areas)
We observe an outcome indicator, Intervention
and its value rises after the program: Intervention
However, we need to identify the counterfactual… Intervention
… since only then can we determine the impact of the intervention
How can we fill in the missing dataon the counterfactual? • Randomization • Matching • Propensity-score matching • Difference-in-difference • Matched double difference • Regression Discontinuity Design • Instrumental variables
1. Randomization“Randomized out” group reveals counterfactual. • Only a random sample participates. • As long as the assignment is genuinely random, impact is revealed in expectation. • Randomization is the theoretical ideal, and the benchmark for non-experimental methods. Identification issues are more transparent compare with other evaluation technique. • But there are problems in practice: • internal validity: selective non-compliance • external validity: difficult to extrapolate results from a pilot experiment to the whole population
2. MatchingMatched comparators identify counterfactual. • Propensity-score matching: Match on the basis of the probability of participation. • Match participants to non-participants from a larger survey. • The matches are chosen on the basis of similarities in observed characteristics. • This assumes no selection bias based on unobservable heterogeneity. • Validity of matching methods depends heavily on data quality.
3. Propensity-score matching (PSM)Match on the probability of participation. • Ideally we would match on the entire vector X of observed characteristics. However, this is practically impossible. X could be huge. • Rosenbaum and Rubin: match on the basis of the propensity score = • This assumes that participation is independent of outcomes given X. If no bias give X then no bias given P(X).
Steps in score matching: 1: Representative, highly comparable, surveys of the non-participants and participants. 2: Pool the two samples and estimate a logit (or probit) model of program participation. Predicted values are the “propensity scores”. 3: Restrict samples to assure common support Failure of common support is an important source of bias in observational studies (Heckman et al.)
Steps in score matching: 4: For each participant find a sample of non-participants that have similar propensity scores. 5: Compare the outcome indicators. The difference is the estimate of the gain due to the program for that observation. 6: Calculate the mean of these individual gains to obtain the average overall gain.
4. Difference-in-difference (double difference) Observed changes over time for nonparticipants provide the counterfactual for participants. • Collect baseline data on non-participants and (probable) participants before the program. • Compare with data after the program. • Subtract the two differences, or use a regression with a dummy variable for participant. • This allows for selection bias but it must be time-invariant and additive.
Selection bias Selection bias
Diff-in-diff requires that the bias is additive and time-invariant
Diff-in-diff: if (i) change over time for comparison group reveals counterfactual and (ii) baseline is uncontaminated by the program,
5. Matched double differenceMatching helps control for bias in diff-in-diff • Score match participants and non-participants based on observed characteristics in baseline • Then do a double difference • This deals with observable heterogeneity in initial conditions that can influence subsequent changes over time
6. Regression Discontinuity Design • Selection function is a discontinuous function • UPP in Indonesia: two similar kecamatan in the same kabupaten that have scores within the neighborhood of the cut off score can be treated differently Selection 1 0 Kecamatan score control treatment
7. Instrumental variablesIdentifying exogenous variation using a 3rd variable Outcome regression: D = 0,1 is our program – not random • “Instrument” (Z) influences participation, but does not affect outcomes given participation (the “exclusion restriction”). • This identifies the exogenous variation in outcomes due to the program. Treatment regression:
Randomization: An example from Mexico • Progresa: Grants to poor families, conditional on preventive health care and school attendance for children. Given to women • Mexican government wanted an evaluation; order of community phase-in was random • Results: child illness down 23%; height increased 1-4cm; 3.4% increase in enrollment • After evaluation: PROGRESA expanded within Mexico, similar programs adopted throughout other Latin American countries
Randomization: An example from Kenya • School-based deworming: treat with a single pill every 6 months at a cost of 49 cents per student per year • 27% of treated students had moderate-to-heavy infection, 52% of comparison • Treatment reduced school absenteeism by 25%, or 7 percentage points • Costs only $3 per additional year of school participation
Lessons randomized experiments • Randomized evaluations are often feasible • Have been conducted successfully • Are labor intensive and costly, but no more so than other data collection activities • Results from randomized evaluations can be quite different from those drawn from retrospective evaluations • NGOs are well-suited to conduct randomized evaluations in collaboration with academics and external funders
Lessons randomized experiments While randomization is a powerful tool: • Internal validity can be questionable if we do not allow properly for selective compliance with the randomized assignment. • Not always feasible beyond pilot projects, which raises concerns about external validity. • Contextual factors influence outcomes; scaled up program may work differently.
Matching Method Example :Piped water and child health in rural India • Is a child less vulnerable to diarrhea if he/she lives in a household with piped water? • Do children in poor, or poorly educated, households realize the same health gains from piped water as others? • Does income matter independently of parental education?
The evaluation problem • There are observable differences between those households with piped water and those without it. • And these differences probably also matter to child health.
Naïve comparisons can be deceptive • Common practice: compare villages with piped water, or some other infrastructure facility, and those without. • Failure to control for differences in village characteristics that influence infrastructure placement can severely bias such comparisons.
Model for the propensity scores for piped water placement in India • Village variables: agricultural modernization, educational and social infrastructure. • Household variables: demographics, education, religion, ethnicity, assets, housing conditions, and state dummy variables.
More likely to have piped water if: • Household lives in a larger village, with a high school, a pucca road, a bus stop, a telephone, a bank, and a market; • it is not a member of a scheduled tribe; • it is a Christian household; • it rents rather than owns the home; this is not a perverse wealth effect, but is related to the fact that rental housing tends to be better equipped; • it is female-headed; • it owns more land.
Impacts of piped water on child health • The results for mean impact indicate that access to piped water significantly reduces diarrhea incidence and duration. • Disease incidence amongst those with piped water would be 21% higher without it. Illness duration would be 29% higher.
Stratifying by income per capita: • No significant child-health gains amongst the poorest 40% (roughly corresponding to the poor in India). • Very significant impacts for the upper 60% • Without piped water there would be no difference in infant diarrhea incidence between the poorest quintile and the richest.
When we stratify by both income and education: • For the poor, the education of female members matters greatly to achieving the child-health benefits from piped water. • Even in the poorest 40%, women’s schooling results in lower incidence and duration of diarrhea among children from piped water. • Women’s education matters much less for upper income groups.
Lessons on matching methods • When neither randomization nor a baseline survey are feasible, careful matching to control for observable heterogeneity is crucial. • This requires good data, to capture the factors relevant to participation. • Look for heterogeneity in impact; average impact may hide important differences in the characteristics of those who gain or lose from the intervention.
Tracking participants and non-participants over time 1. Single-difference matching can still be contaminated by selection bias Latent heterogeneity in factors relevant to participation 2. Tracking individuals over time allows a double difference This eliminates all time-invariant additive selection bias 3. Combining double difference with matching: This allows us to eliminate observable heterogeneity in factors relevant to subsequent changes over time
Improving Evaluation Practice When there is an impact evaluation: • Build in evaluation ex-ante • Make a quality evaluation a primary responsibility of the manager of the program • Allocate the necessary resources • Encourage randomization whenever feasible (education, health, micro-finance, governance, not monetary policy…)
Practical suggestions • Not every project needs impact evaluation: select projects in priority areas, where knowledge needed • Take advantage of budget constraints and phase-in • Require pilot project before large scale project • Finance pilot projects and evaluations with grants • Collaborate with others: • Academics (e.g. Evaluation Based Policy Fund in UK) • NGOs
Evaluation: An Opportunity • Creating hard evidence of success will • spend future resources more effectively • influence other policymakers • build public support