Steps in the Implementation of an Impact Evaluation. Craig McIntosh, IRPS/UCSD Prepared for The Asia Foundation Evaluation Workshop Singapore, March 25, 2010
A Shift in Best Practice in Evaluation: • Impact evaluation useful for • Validating program design • Adjusting program structure • Communicating to finance ministry & civil society • Cultural shift • From retrospective evaluation • Look back and judge • To prospective evaluation • Decide what need to learn • Experiment with alternatives • Measure and inform • Adopt better alternatives overtime
Prospective Evaluation: • The evaluation is designed in parallel with the assignment of the program • Baseline data can be gathered • The implementation of the program is structured across time and across potential beneficiaries with research in mind. • Example: • Progresa/Oportunidades (México) The point of this way of working is to generate counterfactuals.
Frontloading complexity Randomization & other robust research designs have rapidly been gaining ground in policy circles. • If the goal of policy research is to influence policymakers, the evidence from randomized trials is very straightforward and transparent. Experiments like Progresa in Mexico have had huge policy effects. • While the econometric analysis of randomized trials is completely straightforward, their use front-loads all the complexity on to the research design. Implementation is key! • Difficulty in such evaluations is not seeing what you’d like to randomize, but understanding what you will be able to successfully randomize in a given setting, and building a research design around this. Prospective designs make implementation complicated and analysis simple. Need careful thought before you get started.
Getting Underway with a Prospective Evaluation: This presentation focuses on the following issues: • Defining the Research Question. • Identifying the target population for the intervention. • Thinking through how to select the study sample. • Pulling in all pre-existing data that may be useful to the evaluation. • Building the Research Team; division of labor & relationships with the field staff. • Defining the features of the program that are amenable to experimental or quasi-experimental investigation. • A general timeline of the baseline, intervention, followup, and results dissemination.
Operation Guidance: How to define useful questions • “Interesting” is what the government/implementer wants to know from a policy perspective. • Often much more important to learn how to improve a program than to answer ‘yes/no’ impact question • Don’t evaluate programs that you know have flaws; use small-scale pilots to get a ‘best practice’ intervention and evaluate that. • Choose carefully, don’t try to evaluate everything. • Questioning program helps strengthen design • Provide answers to questions early • Utilize results, change programs, learn & improve • Serve clients better!
First practical issue: What is the unit of intervention? How many conceptually separate units will be a part of the total operation of the program? • ‘Micro’ (village or individual-level) interventions are much easier to evaluate than ‘macro’ (national government, supply chains, etc.). However: • Is there a component of a macro program that is implemented at the individual level (outreach, training, sensitization, etc)? • Do you influence the way in which individuals interact with macro structures (firms with markets, individuals with politicians, etc.)?
Some of the core issues to think through: • Are the counterfactuals program beneficiaries (as in a program improvement), in which case institutional data on them exists, or not? Data collection on non-beneficiaries is a big extra cost. • Is the program eventually going to be offered to every type of community (in which case we need a random sample for the evaluation)? Or is the beneficiary pool so well defined that we can focus at the outset on one type? • Are we trying to estimate the impact on recipients of the program, or the impact on a population that is offered the program? • This gets at the difference between the Intention to Treat Effect (ITE) and the Treatment Effect on the Treated (TET).
Types of Treatment Effects: There are three types of treatment effects one can estimate. • If the program in question is universal or mandatory: • Average Treatment Effect (ATE): What is the expected effect of treating the average individual? • If the program has eligibility requirements, or is voluntary with <100% uptake, not all individuals will receive the program in its actual implementation. Then, we won’t see the average individual treated and so don’t care about the ATE. We care about: • Intention to Treat Effect (ITE): What is the expected effect on a whole population of extending treatment to the subset that actually get treated? • Treatment Effect on the Treated (TET): What is the expected individual treatment effect on the type of individual who takes the treatment?
How effect type determines research design: • Average Treatment Effect can be directly randomized. • Intention to Treat Effect: • Take a random sample of the entire population. • Conduct the normal selection process within a randomly selected subset of the sample • ITE given by the difference in outcomes between ‘offered’ and ‘not offered’ random samples of the population. • Treatment Effect on Treated: • Use only the subset of the sample who would have been offered and would have taken the treatment. These are the ‘compliers’. • Randomize treatment within the compliers. • TET given by the difference in outcomes between the treated and untreated compliers. • TET not easily estimated in practice because it require pre-enrollment of treated units and then randomization. In practice, this means offering access to a lottery.
ITE and TET: • ITE compares A & B to C & D. • Comparison of B to C & D contains selection bias even if Treatment & Control offering are randomized • TET compares B to D, but how to establish compliance in the control?
Relationship between ITE and TET: Let be the correct ITE and be the correct TET, and let be the fraction of the population that are compliers. • In the absence of spillover effects (meaning that the non-compliers receive no indirect impact of the treatment), then • If there are non-zero average spillover effects Assuming that spillover effects are smaller than direct treatment effects, it becomes harder to detect the ITE as the uptake rate of the treatment falls. This makes direct randomization of highly selective programs a challenge; the ITE is too small to detect and the TET requires pre-selection.
Why it is important to think through type of TE: • A design that seeks to estimate the ITE should be locating a beneficiary population that is of interest, and then offering program. • Takeup rates are a major problem for estimating ITEs: The lower the takeup, the larger the impact per beneficiary must be to detect impact. • Measuring ITEs require gathering lots of data on people who never take the program. • ITEs usually have a highly aggregated unit (like village) as unit of analysis, therefore studies are big. On the other hand: • A design that seeks to estimate the TET must pre-select beneficiaries before randomization. • This requires enrolling people who believe they will benefit and then selecting out beneficiaries, in general. • Complicated, not always ethical, can generate ill-will towards implementer. In practice, usually estimate ITE with randomized designs.
Manage trade-offs: Method &Data • Experimental • Least costly • Best estimates • Not always politically feasible • Non-experimental • More expensive • Good second best estimates • Always feasible • Administrative data (expanded to control) • Most cost effective • Piggy back on ongoing data collection • Cost effective • Coordination intensive • Collect special surveys • Tailored and timely • Expensive
Using Pre-existing data: • MIS data from the implementer is incredibly rich, however may not exist for the controls. • Utilize national statistics offices to locate pre-existing sources of information from that area, such as: • Censuses • World Bank Living Standards Measurement Surveys, Poverty Maps. • World Bank Doing Business Surveys • Demographic & Health Surveys • GIS data often provides an extremely convenient way of overlaying sources of data that were not explicitly designed to relate to each other. • Very important when you design your survey instrument that you know whether there are pre-existing questions that have been asked a specific way so that you can design your survey questions to work as panels.
What is easily randomized? • Information: • Trainings • Political Message dissemination • Dissemination of information about politician quality, corruption • Mailers offering product variation • Promotion of the treatment. • Problem with all of these is that they may be peripheral to the key variation that you really care about. • This has led to a great deal of research which studies that which can be randomized, rather than that which we are interested in. • Decentralized, Individual-level treatments: • Makes evaluation of many of the central questions in Political Science difficult. • Voting systems, national policies, representative-level effects, international agreements not easily tractable. • Voter outreach, message framing, redistricting, audits much more straightforward.
Ways to make good research design feasible: • In deciding who gets access to a program when resources are scarce (particularly in corrupt places) a lottery can be the fairest way to go. Public lotteries even better in many cases? • Same argument with rollouts: if it was to be staggered anyway, this may be the fairest way to determine order. • There may be units that ‘must’ or ‘must not’ be treated; they are not interesting from an evaluation perspective anyway. Therefore designate the group of ‘marginal’ units and conduct the research within this group only. • If there is no way to randomize but program has a clear screening technique, quantify this screening and then analyze impacts using a Discontinuity Design. • If enrollment in the program is voluntary but you have a tool that strongly effects uptake (pricing, promotion, marketing) then randomize this and estimate a LATE.
Practical Issues in the Design of Field Trials: 1. Do you control the implementation directly? • If so, you can be more ambitious in research design. • If not, you need to be brutally realistic about the strategic interests of the agency which will be doing the actual implementation. Keep it simple. • Has the implementing agency placed any field staff on the ground whose primary responsibility is to guard the sanctity of the research design? If not, you MUST do this.
Practical Issues in the Design of Field Trials: 2. Does the program have a complex selection process? • If so, you must build the evaluation around this process. • Either pre-select and estimate TET, or go for ITE. • If uptake is low, you need to pre-select a sample with high uptake rates in order to detect the ITE
Practical Issues in the Design of Field Trials: 3. Is there a natural constraint to the implementation of the program? • If so, use it to identify: • ‘Oversubscription’ method • If rollout will be staggered anyway, then you can often motivate the rationale behind a randomized order to the implementer.
Division of Labor in Research Projects: Questions: • Does the implementer have a conflict of interest in evaluating itself? • Does the implementer have the necessary in-house skills to do whole evaluation? • Is it reasonable to ask an implementer to use project funds to pay for evaluation? If answer to these questions is yes, no, and no, then: Write evaluation as separate line item into grants (clearly broken out from operations). Form a ‘tripartite alliance’ between funders, implementers, and researchers. Each party clearly responsible for one activity upon which their own reputation rests.
Funding of Research Projects: A number of very large-scale funders now exist that are specifically looking for projects that have: • Solid research design. • Good data collection strategy • Close collaboration between researchers and implementers Funders include: • 3ie • Gates • IGC • USAID Academics can help to raise money quickly for a large-scale, promising field evaluation.
Surveys: Should the implementer conduct them? Obvious reason TO do this is cost savings. However: • Good survey work takes a lot of skill. • Implementer can create numerous conflicts of interest by conducting surveys. • Where interventions have a normative dimension (gender issues, governance) then you can get an ‘amen choir’ effect by having the implementer ask attitudinal questions. • Blinding: hard to avoid some bias when the surveyor knows treatment status and has an interest in results. Therefore if the budget for it exists, independent surveying may produce more credible results.
Creating a Team that Works Well Together: • Field staff MUST have buy-in to the project and the research design. • Facilitate communication between researchers & field staff at beginning of project. • Recognize that the incentives placed on field staff are usually somewhat antithetical to research design; need 1 staffer who just oversees research & design. • Survey teams separate from field staff? (blinding, conflict of interest, skills). • Flow of information between research team and field staff must be maintained.
Operational Guidance: Key Steps. 3. Prepare data collection instruments 4. Collect baseline 1. Identify policy questions 2. Design evaluation 5. Implement program in treatment areas 6. Collect follow up 8. Improve program and implement new 7. Analyze and feedback
Conclusions: • Choose the programs for full-scale evaluation carefully: • Soon to see a rapid expansion • Program has some micro component • Some tool exists to create variation in selection into program • Institution has pressing need to understand that program better. • Remember that a randomized trial does not necessarily need a full-blown baseline study! • Define research questions to maximize learning. • Identify members of a research team by thinking through comparative advantage and conflict of interest. • Plan ahead, rigorous control over research design. • Rapid & effective dissemination of results to funders, implementers, and field staff.