Megan T. Stevenson is an active researcher in the criminal-justice-and-economics literature. She has also noted a disconcerting fact: When you look at the published studies that use randomized control trial methods to evaluate ways of reducing crime, most of the studies don’t show a meaningful effect, and of those that do show a meaningful effect, the effect often isn’t replicated in follow-up studies. She mulls over this finding in “Cause, Effect, and the Structure of the Social World” (forthcoming in the Boston University Law Review when they get around to finalizing the later issues of 2023, pp. 2001-2027, but already available at the Review’s website).

(For those not familiar with the idea of a “randomized control trial,” the basic idea is that a group of people are randomly divided. Some get access to the program or the intervention or are treated in a certain way, while others do not. Because the group was randomly divided–and you can check in various ways whether it appears to be random–a researcher can then compare the outcomes between the treated and untreated group. This method is of course similar to drug trials, when you randomly divide up a group and some get the medication while others get a placebo. This approach is sometimes called a “gold standard” methodology, because it’s straightforward and persuasive. But of course, no method is infallible. One can always ask questions like: “Was it really random?” “Was some charismatic person involved in the treatment in a way that won’t carry over to future projects?” “Was the sample size big enough to draw a reliable result?” “Did the researcher study a bunch of treatments, on a number of groups, but then only publish the few results that looked statistically significant?”)

As one example of the evidence on interventions to reduce crime, Stevenson writes (footnotes omitted):

In 2006, two criminologists published a survey article of every RCT over the previous fifty years in which: (1) there were at least 100 participants, (2) the study included a measure of offending as an outcome, and (3) the study was written in English. The authors uncovered 122 studies, evaluating interventions such as:

  • Counseling/therapy programs;
  • Criminal legal supervision, including intensive probation;
  • Scared-straight programs;
  • Work/job-training programs;
  • Drug testing, substance abuse counseling, and drug court;
  • Juvenile diversion;
  • Policing “hot spots”; and
  • Boot camps.

Note that these interventions include those associated with a tough-on-crime framework (e.g., scared-straight programs and boot camps) as well as those that provide support and resources (e.g., work/job training programs and counseling). Note further that inclusion in this analysis required that the study was written up and disseminated so it could be discovered by the survey authors—a filter that is likely to have eliminated many of the nonstatistically significant results already. Nonetheless, only 29 of the 122 studies (24%) found statistically significant impacts in the desired direction.

Stevenson reviews a number of more recent studies as well. But the likelihood of successful results remains low, and worse, the chances that a successful result is not replicated by a future study seems high.

As Stevenson points out, this finding is reminiscent of what Peter Rossi several decades ago called: “The Iron Law of Evaluation: The expected value of any net impact assessment of any large scale social program is zero.” Here, I don’t want to quarrel over whether their might be a few strong counterexamples to Stevenson’s pessimistic evaluation. Instead, what does Stevenson suggest should be learned from this discouraging pattern of findings? I’d paraphrase her arguments this way.

While it’s an attractive idea that a relatively small treatment will fundamentally alter an unpleasant outcome like crime (say, a job-training program or “hot-spot” policing), there are often underlying reasons why people make the decisions they do. Stevenson writes: “That doesn’t mean that human actions never have an impact, but rather that the type of discrete, limited scope interventions that are the primary domain of empirical causal inference research generally have limited or nonreplicable impact.”

The positive effects of some policies may be so obvious that they don’t get studied by a randomized trial. For example, feeding the hungry accomplishes a goal of feeding the hungry. One might study other possible effects of such a policy on crime or labor force participation or family dynamics, and that’s where the randomized control trial doesn’t reliably find positive effects. But the hungry did get fed. Stevenson writes:

There is an old cliché that if you give a man a fish, he will eat for a day; if you teach him how to fish, he will eat for a lifetime. Such sentiments form the basis of many of the interventions discussed in this study. These interventions, designed to give people the resources to thrive on their own, rarely have large or lasting impact. The cliché is wrong, at least when it comes to the limited-scope, systems-conserving interventions. However, there remains a straightforward and obvious way to ameliorate harm: simply give people what they need. If they are hungry, give them food. If they need shelter, give them a home. If they need work, give them a job.

The effects of certain policy choices may never get studied by a randomized control trial, because the policies are so sweeping. Perhaps changing people’s lives requires a group of policies sustained over a long period of time, and then evaluated after an even longer period. When people call for “systemic” change, they presumably have in mind a set of changes that can’t be captured by dividing up a group at random and treating one part of the group in a specific but limited way. But of course, systemic change can be very hard to evaluate in advance, and can have either good or bad outcomes.

Finally, Stevenson is asking the social science research community about whether it is overemphasizing the “gold standard” method of randomized control trials, rather than perhaps seeking out evidence from real-world experience. Her sense is that researchers may tend to follow the randomized control trial methodology because they think it is more likely to result in published papers, rather than because it’s the best way to get a persuasive answer. To put it another way, persuasive evidence for a policy can come from a variety of methods, and randomized control trials are only one of those methods.

Stevenson’s paper made me think of a recent wave of research on some of the social programs implemented several decades ago. For example, the food stamp program was rolled out, county-by-county, over the period from 1961 to 1974. The order in which counties were selected was determined by practical and political considerations, and for practical purposes can be viewed as largely random (that is, no particular group was systematically overrepresented in being covered earlier by the foot stamp program). This is sometimes called a “quasi-experiment,” referring to the idea that some families were randomly eligible for food stamps and others were not, but that pattern wasn’t designed by anyone. However, a researcher can come along later and take advantage of the randomization. In this case, it turns out that children under the age of five who were in counties that got food stamps earlier had positive long-term effects in adult health, earnings, and lower crime rates, among other factors.