you're reading...
Evaluation

The ethics of evaluation

More thought has been given to the validity of the conclusions drawn from development impact evaluations than to the ethical validity of how the evaluations were done. This is not an issue for all evaluations. Sometimes an impact evaluation is built into an existing program such that nothing changes about how the program works. The evaluation takes as given the way the program assigns its benefits. So if the program is deemed to be ethically acceptable then this can be presumed to also hold for the method of evaluation. (I leave aside ethical issues in how evaluations are reported and publication biases.) We can dub these “ethically benign evaluations.”

Another type of evaluation deliberately alters the program’s (known or likely) assignment mechanism—who gets the program and who does not—for the purpose of the evaluation. Then the ethical acceptability of the intervention does not imply that the evaluation is ethically acceptable. Call these “ethically contestable evaluations.” The main examples in practice are randomized control trials (RCTs). Scaled-up programs almost never use randomized assignment, so the RCT has a different assignment mechanism, and this may be contested ethically even when the full program is fine.

A debate has emerged about the ethical validity of RCTs. This has been brewing for some time but there has been a recent flurry of attention to the issue, stimulated by a New York Times post last week by Casey Mulligan and various comments including an extended reply by Jessica Goldberg.  Mulligan essentially dismisses RCTs as ethically unacceptable on the grounds that some of those to which a program is assigned for the purpose of evaluation—the “treatment group”—will almost certainly not need it, or benefit little, while some in the control group will. As an example, he endorses Jeff Sachs’s arguments as to why the Millennium Villages project was not set up as an RCT. Goldberg defends the ethical validity of RCTs against Mulligan’s critique. On the one hand she argues that randomization can be defended as ethically fair given limited resources, while (on the other hand) even if one still objects, the gains from new knowledge can outweigh the objections.

I have worried about the ethical validity of some RCTs, and I don’t think development specialists have given the ethical issues enough attention. But nor do I think the issues are straightforward. So this post is my effort to make sense of the debate.

Ethics is a poor excuse for lack of evaluative effort. For one thing, there are ethically benign evaluations. But even focusing on RCTs, I doubt if there are many “deontological purists” out there who would argue that good ends can never justify bad means and so side with Mulligan, Sachs and others in rejecting all RCTs on ethical grounds. That is surely a rather extreme position (and not one often associated with economists). It is ethically defensible to judge processes in part by their outcomes; indeed, there is a long tradition of doing so in moral philosophy, with utilitarianism as the leading example. It is not inherently “unethical” to do a pilot intervention that knowingly withholds a treatment from some people in genuine need, and gives it to some people who are not, as long as this is deemed to be justified by the expected welfare benefits from new knowledge.

Far more problematic is either of the following:

  • Any presumption that an RCT is the only way we can reliably learn. That is plainly not the case, as anyone familiar with the full range of (quantitative and qualitative) tools available for evaluation will know.
  • Any evaluation for which the expected gains from new knowledge cannot reasonably justify an ethically-contestable methodology.

The latter situation is clearly objectionable if it is seen to hold. But it is often hard to verify in development settings. Ethics has been much discussed in medical research. In that context, the principle of equipoise requires that there should be no decisive prior case for believing that the treatment has impact sufficient to justify its cost. (This is David McKenzie’s sensible modification to clinical equipoise to fit the types of programs in discussion here.) By this reasoning, only if we are sufficiently ignorant about the likely gains relative to costs should we evaluate further. Implementation of such an ethical principle may not be easy, however. In the context of antipoverty or other public programs, a priori (theoretical and/or empirical) arguments can often be made both for and against believing ex ante that impact is likely.  A clever researcher can often create a convincing straw man to suggest that some form of equipoise holds and that the evaluation is worth doing. While this cannot be prevented, we should at least demand that the case is made, and it stands up to scholarly public scrutiny.  That is clearly not the norm at present.

It has often been argued that whenever rationing is required—when there is not enough money to cover everyone—randomized assignment is a fair solution. (Goldberg makes this claim, though I have heard it often. Indeed, I have made this argument a few times with government counterparts in attempting to convince them on the merits of randomization.) In practice, this is clearly not the main reason that randomistas randomize. But should it convince the un-believers? It can be accepted when information is very poor, or allocative processes are skewed against those in need. In some development applications we may know very little ex ante about how best to assign participation to maximize impact. But when alternative allocations are feasible (and if randomization is possible then that condition is evidently met) and one does have information about who is likely to benefit, then surely it is fairer to use that information, and not randomize, at least unconditionally.

Conditional randomization can help relieve ethically concerns. One first selects eligible types of participants based on prior knowledge about likely gains, and only then randomly assigns the intervention, given that not all can be covered. For example, if one is evaluating a training program or a program that requires skills for maximum impact one would reasonably assume (backed up by some evidence) that prior education and/or experience will enhance impact and design the evaluation accordingly.  This has ethical advantages over simple randomization when there are priors about likely impacts.

But there is a catch. The set of things observable to the evaluator is typically only a subset of what is observable on the ground (such information asymmetry is, after all, the reason for randomizing in the first place). At local level, there will typically be more information—revealing that the program is being assigned to some who do not need it, and withheld from some who do. The RCT may be ethically unacceptable at (say) village level. But then whose information should decide the matter?  It may be seen as quite lame for the evaluator to plead, “I did not know” when others do in fact know very well who is in need and who is not.

Goldberg reminds us of another defense often heard, namely that RCTs can use what are called “encouragement designs.”  The idea here is that nobody is prevented accessing the primary service of interest (such as schooling) but the experiment instead randomizes access to some form of incentive or information. This may help relieve ethical concerns for some observers, but it clearly does not remove them—it merely displaces them from the primary service of interest to a secondary space. Ethical validity still looms as a concern when any “encouragement” is being deliberately withheld from some people who would benefit and given to some who would not.

While ethical validity is a legitimate concern in its own right, it also holds implications for other aspects of evaluation validity. There is heterogeneity in the ethical acceptability of RCTs. That will vary from one setting to another. One can get away with an RCT more easily with NGOs than governments, and with small interventions, preferably in out-of-the-way places. (By contrast, imagine a government trying to justify why some of its under-served rural citizens were randomly chosen to not get new roads or grid connections on the grounds that this will allow it to figure out the benefits to those that do get them.) An exclusive reliance on randomization for identifying impacts will likely create a bias in our knowledge in favor of the settings and types of interventions for which randomization is feasible; we will know nothing about a wide range of development interventions for which randomization is not an option. (I discuss this bias for inferences about development impact further in “Should the Randomistas Rule?”.) Given that evaluations are supposed to fill our knowledge gaps, this must be a concern even for those who think that consequences trump concerns about processes.

If evaluators take ethical validity seriously there will be implications for RCTs. Some RCTs may have to be ruled out as simply unacceptable. For example, I surely cannot be the only person who is troubled on ethical grounds by the (innovative) study done in Delhi India by Marianne Bertrand et al. that randomized an encouragement to obtain a driver’s license quickly, on the explicit presumption that this would entail the payment of a bribe to obtain a license without knowing how to drive. (This study was conducted and funded by the World Bank’s International Finance Corporation. And it was published in a prestigious economics journal.) The study confirmed that the process of testing and licensing was not working well even for the control group. But the RCT put even more drivers on Delhi roads who did not know how to drive, adding to the risk of accidents. The gain from doing so was a clean verification of the claim that corruption is possible in India and has real effects, though I was not aware of any prior doubt about the truth of that claim.

There may well be design changes to many RCTs that could assure their ethical validity, such as judged by review boards. One might randomly withhold the option of treatment for some period of time, after which it would become available, but this would need to be known by all in advance, and one might reasonably argue that some form of compensation would be justified by the delay. Adaptive randomizations are getting serious attention in biomedical research; for example, one might adapt the assignment to treatment of new arrivals along the way, in the light of evidence collected on covariates of impact. (The U.S. Food and Drug Administration issued guidelines a few years ago.)

The experiment might not then be as clean as in the classic RCT—the prized internal validity of the RCT in large samples may be compromised. But if that is always judged to be too high a price then the evaluator is probably not taking ethical validity seriously.

Martin Ravallion

(First posted on the World Bank’s Development Impact blog.)

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Enter your email address to follow this blog and receive notifications of new posts by email.