Field Research Methods

Field Research Methods Lecture 9: Randomized Field Experiments/Controlled Trials Edmund Malesky, Ph.D.

Last Chance • Do you have any questions about your final projects????

All the Rage • In recent years, the use of randomized experiments has exploded in the social sciences, particularly in the field of applied micro-economics and developmental economics. • Recently, randomized experiments have begun to gain currency among development practitioners and political scientists as well.

All of These Types of Experiments are Increasing in Political Science

The Gold Standard • Randomized Controlled Trials (RCT), while not universally accepted are beginning to gain currency as the “gold standard in policy evaluation.” • At the forefront of this debate has been the Jameel Poverty Action Lab at MIT. • JPAL commented famously only2% of World Bank projects are properly evaluated using RCT, setting off an intense debate. • The debate still continues….

A Great Culmination for the Class • Pulling off an RCT, involves nearly all of the tools we have been adding to our tool kit this quarter. • Research design/asking the right question • Survey design • Sampling techniques • Case Selection

Brief History • Not really new in the social sciences • Psychologists were performing experiments in the 1800s • Harold Gosnell began used experiments in his work on machine politics in New York. “Does canvassing increase voter turnout?” • While political science had a behavioralistrevolution in the 50s and 60s.They tended to rely on surveys and not experiments.

Why Experiments Delayed • Political science/Economics – study of specific “real world” behavioral domain (unlike psychology) • Traditionally & reasonably worry about artificiality. • Experimentation, by introducing artificiality is suspect. • Control and manipulation not always possible for research.

Organization of Today’s Lecture • Why Randomization? • 3 Ingredients for Randomization • Conducting Randomized Controlled Treatment (RCT) Experiments. • Issues in RCT design • Summary of Today’s Readings • A close look at the Olken study • Field vs. Laboratory

Why Randomized Experiments? • A Refresher: Throughout this course, we have referred to the two major issues which plague causal inference in research projects. What are they? • Selection Bias • Endogeneity

Motivation: selection equation notation • The motivation for randomization in policy evaluation is by now well established: • Regression (OLS) and matching estimators require a very similar assumption of selection on observables. If the selection equation is: and the outcomes equation is then the only way in which the impact estimator can be unbiased is if , or if the unobservables in the selection and outcome equations are uncorrelated. • With random assignment of treatment in a large sample, and so is unbiased by construction.

Motivation: potential outcomes notation is the outcome for a given individual without the treatment, and is the outcome for the same individual if treated. The Average Treatment Effect conditional on X is , but we do not observe the same individuals in both treated and control states. Therefore we must estimate: Bias is given by With a clean randomization the counterfactual outcomes of the treated and control group would have been the same, and so the bias term goes to zero.

An Example of How Randomization Addresses Selection Bias. From each element, I subtract the term E[YiC[T], which is the expected outcome in the treatment groups, had he/she not been treated Selection Bias: Difference in untreated outcomes between treated and comparison group. Treatment Effect: “Effect of Treatment on the Treated” Average Treatment Effect

Why Randomized Experiments? • To these general problems, we can add two problems specific to field research: • Seldom find empirical situations where ONLY variables of theoretical interest are active • Problem of measuring variables of theoretical interest (utilities, beliefs) in natural settings. • Assume functions of what can be measured (I.e. voting choices, survey answers, v. true voter preferences)

Policy Evaluation introduces a 5th, but related dilemma • The unobserved counterfactual • Any discussion of the impact of a policy intervention, essentially boils down to answering a counterfactual question – How would the individual or group fared in the absence of the intervention? • But at a given point in time, an individual is either exposed to the program or not. • Comparing over time is not an adequate solution, because other factors may be changing over time as well.

Why Randomized Experiments? • If performed correctly, randomization helps a researcher resolve all 5 of these dilemmas. • Selection Bias – When individuals or groups are randomly assigned to treatment and comparison groups, we are able to isolate the impact of the treatment on the specific sub-group from which the population is drawn. • Endogeneity: Randomization helps us solve this problem, because it allows the researcher to introduce the shock and follow its effect forward, rather than simply observing the relationship at a single point in time.

Why Randomized Experiments? • Inability to hold other factors constant. • The Duflo et al reading has a very interesting point to make here that is worth exploring in depth. • Y is actually the overall impact of a particular treatment, allowing other factors to change in response to the treatment. A researcher may be interest in two different things: • How do changes in t (one element of the vector I) affect Y when all other elements are held constant? The Partial Derivative. • The researcher also may be interested in changes in Y when other factors are allowed to vary in response to a change in t. The Total Derivative. • The researcher can track the impact of a change on the behavior of individual regarding other elements (Substitution effects etc.)

Why Randomized Experiments? • Measurement Difficulties • Rather than measuring perceptions, we can measure how individuals actually respond to changes in incentives (revealed preferences). • Unobserved Counterfactual • We still don’t have the perfect counterfactual, but we have the next best thing – a comparison to individuals or groups that are similar in every way except the treatment.

Why Randomized Experiments? • Other solutions to selection bias without RCT • Controlling for observables in regression analysis. • Only works if matrix of control variables must control all the differences between the control and treatment groups. • This omitted variable bias is unfortunately not testable. • Regression Discontinuity Design • Takes advantage of the fact that entry into the program is a function of a discontinuous function of one or more variables. • Compare treated cases to those who just missed the threshold. • Problem is not always easy to prove discontinuity (easily violated). • Difference-in-Difference and Fixed Effects • With panel framework look at changes pre and post treatment within panel fixed effects.

Frontloading complexity So, randomization has rapidly been gaining ground in policy circles. • If the goal of policy research is to influence policymakers, the evidence from randomized trials is very straightforward and transparent. Experiments like Progresa in Mexico have had huge policy effects. • While the econometric analysis of randomized trials is completely straightforward, their use front-loads all the complexity on to the research design. Implementation is key! • Difficult in such evaluations is not seeing what you’d like to randomize, but understanding what you will be able to successfully randomize in a given setting, and building a research design around this.

Randomization increasingly common in PoliSci: Direct Field Experiments: • Bjorkman & Svensson ‘Power to the People: Evidence from a Randomized Experiment of a Citizen Report Card in Uganda’ • Green, ‘Mobilizing African-American Voters using Direct Mail and Commercial Phone Banks: A Field Experiment’ • Horiuchi, Imai, & Taniguchi, ‘Designing and Analyzing Randomized Experiments: Application to a Japanese Election Survey Experiment’. • Humphreys, Masters, & Sandbu, ‘The Role of Leaders in Democratic Deliberations: Results from a Field Experiment in Sao Tome & Principe’ • Olken ‘Monitoring Corruption: Evidence from a Field Experiment in Indonesia’ • Vicente ‘Is Vote Buying Effective? Evidence from a Randomized Experiment in West Africa’ • Wantchekon ‘Clientelism and Voting Behavior: Evidence from a Field Experiment in Benin’ Use of Incidental Randomizations: • Chattopadhyay & Duflo ‘Women as Policy Makers: Evidence from a Randomized Policy Experiment in India’ • Finan & Ferraz, ‘Exposing Corrupt Politicians: The Effect of Brazil’s Publicly Released Audits on Electoral Outcomes’

Three Critical Experiment Ingredients • Manipulation • Type of Treatment Effects • Control (baseline, comparisons) • Random Assignment

Conducting An Experiment • Partners • Government: Take advantage of government willingness to pilot a public policy program. • Working with government requires high-level consensus and may face difficulties from officials who may have constituents upset about discrimination. • NGOs: Not subject to discrimination problems, because their programs are always isolated and individualized to some extent. • Are results dependent on an impossible to replicate organizational culture? • Multilaterals: • For profit firms: Especially in the word of micro-credit.

Methods of Randomization • Classical Clinical Design • Oversubscription Method • Resources limit selection into a program. Use a lottery to determine selection into the program. Incentive of individuals is similar allowing for comparison. • Randomized Order of Phase-In • Program will be phased in over-time, can compare early groups to later groups (Miguel and Kremer) • Beware that groups not selected may change behavior out of the knowledge that the will be selected next. Both positive and negative effects. • Cannot study long-term effects. • Within Group Randomization • i.e. Different grades within school • High risk of contamination. • Encouragement Design • Evaluate the impact of a treatment that is available to all, but the take-up is not universal. • Information provision

Types of Treatment Effects: There are three types of treatment effects one can estimate. • If the program in question is universal or mandatory: • Average Treatment Effect (ATE): What is the expected effect of treating the average individual? • If the program has eligibility requirements, or is voluntary with <100% uptake, not all individuals will receive the program in its actual implementation. Then, we won’t see the average individual treated and so don’t care about the ATE. We care about: • Intention to Treat Effect (ITE): What is the expected effect on a whole population of extending treatment to the subset that actually get treated? • Treatment Effect on the Treated (TET): What is the expected individual treatment effect on the type of individual who takes the treatment?

How effect type determines research design: • Average Treatment Effect can be directly randomized. • Intention to Treat Effect: • Take a random sample of the entire population. • Conduct the normal selection process within a randomly selected subset of the sample • ITE given by the difference in outcomes between ‘offered’ and ‘not offered’ random samples of the population. • Treatment Effect on Treated: • Use only the subset of the sample who would have been offered and would have taken the treatment. These are the ‘compliers’. • Randomize treatment within the compliers. • TET given by the difference in outcomes between the treated and untreated compliers. • TET not easily estimated in practice because it require pre-enrollment of treated units and then randomization. In practice, this means offering access to a lottery.

ITE and TET: • ITE compares A & B to C & D. • Comparison of B to C & D contains selection bias even if Treatment & Control offering are randomized • TET compares B to D, but how to establish compliance in the control?

Relationship between ITE and TET: Let be the correct ITE and be the correct TET, and let be the fraction of the population that are compliers. • In the absence of spillover effects (meaning that the non-compliers receive no indirect impact of the treatment), then • If there are non-zero average spillover effects Assuming that spillover effects are smaller than direct treatment effects, it becomes harder to detect the ITE as the uptake rate of the treatment falls. This makes direct randomization of highly selective programs a challenge; the ITE is too small to detect and the TET requires pre-selection.

Statistical Power • Power is the probability of rejecting the null hypothesis when a specific alternative hypothesis is true. • In a study comparing two groups, power is the chance of rejecting the null hypothesis that the two groups share a common population mean and therefore claiming that there is a difference between the population means of the two groups, when in fact there is a difference of a given magnitude. • It is thus the chance of making the correct decision, that the two groups are different from each other.

Power & Significance: Left-hand curve is the distribution of beta hat under the null that it is zero, Right-hand curve is the distribution of beta hat if the true effect size is beta. Significance comes from the right tail of the left-hand curve, Power comes from the left-tailed distribution of the right-hand curve. (source: Duflo & Kremer ‘Toolkit’)

Power & Significance. How many observations is ‘enough’? • Not a straightforward question to answer. • Even the simplest power calculation requires that you know the expected treatment effect ETE, the variance of outcomes and the treatment uptake percentage • From here, you need to pick a ‘power’ (the probability that you reject when you should reject, and thus avoid Type II error), typically and (one-tailed). • Then, pick ‘significance (the probability that you falsely reject when you should accept, and thus commit Type I error), typically and (two-tailed). With these you can calculate the minimum sample size as a function of the desired test power.

Minimum Sample Size: Then, And so you can get away with a smaller sample size if you have: • High expected treatment effects • Low variance outcomes • Treatment & control groups of similar sizes (p=.5) • A willingness to accept low significance and power.

Levels of Randomization • Which level does one randomize at? • Issues: • The larger the groups, the larger the sample size needed to achieve a given power. • If spillover bias is a threat, randomization should occur at a high enough level to capture these effects. • Randomization at the group level can be easier to implement. • Individual-level randomization may create substantial resentment towards the implementing organization.

Clustered Treatment Designs: It is often natural to implement randomization at a unit more aggregated than the one at which data is available. • Examples: • School- or village-level randomization of programs using students • Market- or city-level tests of political messages using voters • Hospital-level changes in medical practice using patients The impact of this ‘design effect’ on the power of the test being conducted is analogous to the clustering of standard errors in the significance of regression estimators. Bottom-line: the power of the test has more to do with the number of units over which you randomize than it does with the number of units in the study.

Clustered Treatment Designs: Look at the difference between the ‘minimum detectable effect’, the smallest true impact that an experiment has a good chanceof detecting. Without clustered design: With clustered design: ( is the number of equally-sized clusters, is the intra-cluster correlation, and is the obs per cluster.)

Power calculations in practice: • Use software! Numerous free programs exist on the internet: • ‘Optimal Design’ • http://sitemaker.umich.edu/group-based/optimal_design_software • ‘Edgar’ • http://www.edgarweb.org.uk/ • ‘G*Power’ • http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/ Many of these use medical not social science descriptions of statistical parameters and can be confusing to use. Make sure to use power calculator with the ability to handle clustered designs if your study units are not the same as the intervention units. Reality check: You very frequently face a sample size constraint for logistical reasons, and then power calculations are relevant only to provide you the probability that you’ll detect anything.

Conducting the Actual Randomization. In many ways simpler than it seems. Easy to use random-number generator in Excel or Stata. • For straight one-shot randomizations: • List the ids of the units in the randomization frame. • Generate a random number for each unit in the frame. • Sort the list by the random number. • Flip a coin for whether the first or the second unit on the list goes into the treatment or the control. • Assign every other unit to the treatment. • Enjoy!

Stratification & Blocking. Why might you not want to do a single-shot randomization? Imagine that you have a pre-observable continuous covariate X which you know to be highly correlated with outcomes. • Why rely on chance to make the treatment orthogonal to this X? You can stratify across this X to generate an incidence of treatment which is orthogonal to this variable by construction. What if you have a pre-observable discrete covariate which you know to be highly correlated with outcomes, or if you want to be sure to be able to analyze treatment effects within cells of this discrete covariate? • You can Block on this covariate to guarantee that each subgroup has the same treatment percentage as the sample as a whole. The expected variance of a stratified or blocked randomized estimator cannot be higher than the expected variance of the one-shot randomized estimator.

Conducting Stratified or Blocked Randomizations. Again, not very hard to do. • For stratified or blocked randomizations: • Take a list of the units in the randomization frame. • Generate a random number for each unit in the frame. • Sort the list by stratification or block criterion first, then by the random number. • Flip a coin for whether you assign the first unit on the list to the treatment or the control. • Then alternate treatment status for every unit on the list; this generates p=.5. • For multiple strata or blocks: • Sort by multiple strata or blocks and then by the random number last, then follow same as above.

Post-randomization tests: Quite common for researchers to write loops which conduct randomizations many times, check for balance across numerous characteristics, keep re-running until balance across a pre-specified set of statistics has been realized. Debate exists over this practice. Of course, generates a good-looking t-table of baseline outcomes. Functions like a multi-dimensional stratification criterion. However: • T-tests of difference based on a single comparison of means are no longer correct, and • It is not easy to see how to correct for the research design structure in the estimation of treatment effects (Bruhn & McKenzie, 2008).

Why wouldn’t you always block or stratify? Bruhn & McKenzie demonstrate that the research design structure must be reflected by the treatment of standard errors in the estimation equation. • For example, if you block on discrete values then you need to include fixed effects for these values in the estimation equation. This uses up DOF. Is this worth it? • Answer: it is worthwhile to block if you are blocking on a characteristic which has powerful effects on the outcome. Otherwise, the blocked design eats up degrees of freedom and results in a lower power final test than otherwise. • Differences between blocked or stratified & one-shot randomizations tend to disappear with a number of units greater than 300.

What is easily randomized? • Information: • Trainings • Political Message dissemination • Dissemination of information about politician quality, corruption • Mailers offering product variation • Promotion of the treatment. • Problem with all of these is that they may be peripheral to the key variation that you really care about. • This has led to a great deal of research which studies that which can be randomized, rather than that which we are interested in. • Decentralized, Individual-level treatments: • Makes evaluation of many of the central questions in Political Science difficult. • Voting systems, national policies, representative-level effects, international agreements not easily tractable. • Voter outreach, message framing, redistricting, audits much more straightforward.

Practical Issues in the Design of Field Trials: • Do you control the implementation directly? • If so, you can be more ambitious in research design. • If not, you need to be brutally realistic about the strategic interests of the agency which will be doing the actual implementation. Keep it simple. • Has the implementing agency placed any field staff on the ground whose primary responsibility is to guard the sanctity of the research design? If not, you MUST do this. • Does the program have a complex selection process? • If so, you must build the evaluation around this process. • Either pre-select and estimate TET, or go for ITE. • If uptake is low, you need to pre-select a sample with high uptake rates in order to detect the ITE • Is there a natural constraint to the implementation of the program? • If so, use it to identify: • ‘Oversubscription’ method • If rollout will be staggered anyway, then you can often motivate the rationale behind a randomized order to the implementer.

Imperfect Randomization • Local Average Treatment Effect (LATE) • Partial Compliance • Try to choose the design with the highest level of compliance • Externalities • Spillover effects both within and outside groups. • If spillover is likely, design the program to address them (Miguel and Kremer). • Attrition • Random attrition will impact calculation of standard errors. • Systematic attrition will actually bias results. • Make sure to record and track attrition.

Experimental Estimation of Spillover Effects: In the presence of spillover effects, the simple treatment-control difference no longer gives the correct treatment effect. Spillover effects cause trouble for designs where the treatment saturation is blocked, but there are a couple of easy ways to use or create variation to measure them directly. • Miguel & Kremer, ‘Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities’. • Deworming program randomized at the school level • Controlling for the number of pupils within a given distance of an untreated unit, they look at how outcomes change as a function of the number of these pupils that were randomly assigned to treatment. • Because treatment is randomized, localized intensity of treatment is incidentally randomized. • Baird, McIntosh, & Özler, ‘Schooling, Income, & HIV Risk in Malawi’. • Conditional Cash Transfer Program run at the Village level • Saturation of treatment in each village directly randomized to compare untreated girls in treatment villages to the control as a function of the share of girls in the corresponding village that were treated.

Randomizing things that alter the probability of treatment: Local Average Treatment Effects. Angrist, Imbens, and Rueben, ‘Identification of Causal Effects using Instrumental Variables’, JASA 1996: • Paper tests for the health impact of serving in Vietnam using the Vietnam Draft lottery number as an instrument for the probability of serving. • Identification problem is the endogeneity of the decision to serve in Vietnam. • Draft lottery number is a classic instrument in the sense that it has no plausible connection with subsequent health outcomes other than through military service and is not itself caused by military service (exclusion), and strongly drives participation (significance in first stage). • Only atypical element is that the first-stage outcome is binary. • LATE estimates a quantity similar to the TET. • Rather than giving a treatment effect for all compliers, it gives the treatment effect on those units induced to take the treatment as a result of the instrument. • LATE can’t capture impacts of serving on those would have registered no matter what, nor of those who dodged the draft. • Therefore provides a very clean treatment effect on a very unclear subset of the sample. You not know which units identify the treatment.

Application of LATE in the field: In principle, very attractive method: • Don’t have to randomize access to the treatment, just some which alters the probability of taking the treatment. • Opens up the idea that promotions, publicity, pricing could be randomized in simple ways to gain experimental identification Problem in practice: • It turns out to be pretty tough to promote programs well enough to gain first-stage significance. • Many field researchers have tried LATE promotions in recent years, very few papers are coming out because the promotions have not been sufficiently effective. • Even if they work, need to ask whether this newly promoted group represents a sample over which you want treatment effects. Sometimes it does, sometimes it doesn’t.

Internal vs. External Validity in Randomized Trials: Randomized field researchers tend to be meticulous about internal validity, somewhat dismissive of external validity. • To some extent, what can you say? You want to know whether these results would hold elsewhere? Then replicate them there! • However, this is an attitude much excoriated by policymakers: Just tell us what works and stop asking for additional research money! So, given that it is always difficult to make claims about external validity from randomized trials, what can you do? • The more ‘representative’ the study sample is of a broader population, the better. • The more heterogeneous the study sample is, the more ability you have to (for example) reweight the sample to look like the population and therefore use the internal variation to project the external variation. • Beware of controlling the treatment in a randomized trial to such a perfect extent that it stops resembling what the project would actually look like when implemented in the field. Remember that your job is to analyze the actual program, ‘warts and all’.

External Validity - Generalizability • Hawthorne Effects • John Henry Effects: Non-treatedworks harder out of spite.

Field Research Methods