One basic standard of economic research is surely that someone else should be able to reproduce what you have done. They don\’t have to agree with what you\’ve done. They may think your data is terrible and your methodology is worse. But as a minimal standard, they should be able to reproduce your result, so that the follow-up research can then be in a position to think about what might have been done differently or better. This standard may seem obvious, but during the last 30 years or so, the methods for reproducibility have been transformed.
Lars Vilhuber describes the shift in \”Reproducibility and Replicability in Economics\” in the Harvard Data Science Review (Fall 2020 issue, published December 21, 2020). Vilhuber is the Data Editor for the journals published by the American Economic Association (including the Journal of Economic Perspectives where I work as Managing Editor). Thus, he heads the group which oversees posting of data and code for new empirical results in AEA journals–including making sure that an outsider can use the data and code to reproduce the actual results reported in the paper.
To jump to the bottom line, Vilhuber writes: \”Still, after 30 years, the results of reproducibility studies consistently show problems with about a third of reproduction attempts, and the increasing share of restricted-access data in economic research requires new tools, procedures, and methods to enable greater visibility into the reproducibility of such studies.\”
Thus, one change in recent years is what are called \”restricted access data environments,\” where accredited researchers can get access to detailed data, but in ways that protect individual privacy. For example, there are now 30 Federal Statistical Data Research Centers around the country, mostly located close to big universities. Vilhuber writes (citations omitted):
It is worth pointing out the increase in the past 2 decades of formal restricted-access data environments (RADEs), sponsored or funded by national statistical offices and funding agencies. RADE networks, with formal, nondiscriminatory, albeit often lengthy access protocols, have been set up in the United States (FSRDC), France, and many other countries. Often, these networks have been initiated by economists, though widespread use is made by other social scientists and in some cases health researchers. RADE are less common for private-sector data, although several initiatives have made progress and are frequently used by researchers: Institute for Research on Innovation and Science, Health Care Cost Institute , Private Capital Research Institute (PCRI). When such nondiscriminatory agreements are implemented at scale, a significant number of researchers can obtain access to these data under strict security protocols. As of 2018, the FSRDC hosted more than 750 researchers on over 300 projects, of which 140 had started within the last 12 months. The IAB FDZ [a source of German employment data] lists over 500 projects active as of September 2019, most with multiple authors. In these and other networks, many researchers share access to the same data sets, and could potentially conduct reproducibility studies. Typically, access is via a network of secure rooms (FSRDC, Canada, Germany), but in some cases, remote access via ‘thin clients’ (France) or virtual desktop infrastructure (some Scandinavian countries, data from the Economic Research Service of the United States Department of Agriculture [USDA] via NORC) is allowed.
A common situation is that this kind of data often cannot be put into the public domain; instead, you would need to apply and to gain access to the \”restricted access data environment,\” and access the data in that way.
Another issue is that in some of these data sources, researchers are not given access to all of the data; instead, to protect privacy, they are given an extract of the overall data. As a result, two researchers who go to the data center and make the same data request will not get the same data. The overall patterns in the data should be pretty close, if random samples are used, but they won\’t be the same. Vilhuber writes:
Some widely used data sets are accessible by any researcher, but the license they are subject to prevents their redistribution and thus their inclusion as part of data deposits. This includes nonconfidential data sets from the Health and Retirement Study (HRS) and the Panel Study of Income Dynamics (PSID) at the University of Michigan and data provided by IPUMS at the Minnesota Population Center. All of these data can be freely downloaded, subject to agreement to a license. IPUMS lists 963 publications for 2015 alone that use one of its data sources. The typical user will create a custom extract of the PSID and IPUMS databases through a data query system, not download specific data sets. Thus, each extract is essentially unique. Yet that same extract cannot be redistributed, or deposited at a journal or any other archive.<span class="footnote" data-node-type="footnote" data-value="
For IPUMS, extracts from population samples (e.g., the 5% sample of the U.S. population census) rather than full population censuses (the 100% file) can be provided to journals for the purpose of replication.
\” date-structured-value=\”\” id=\”iurdxu11kb\”>undefined In 2018, the PSID, in collaboration with ICPSR, has addressed this issue with the PSID Repository, which allows researchers to deposit their custom extracts in full compliance with the PSID Conditions of Use.
Yet another issue arises with data from commercial sources, which often require a fee to access:
Commercial (‘proprietary’) data is typically subject to licenses that also prohibit redistribution. Larger companies may have data provision as part of their service, but providing it to academic researchers is only a small part of the overall business. Dun and Bradstreet’s Compustat, Bureau van Dijk’s Orbis, Nielsen Scanner data via the Kilts Center at Chicago Booth (Kilts Center, n.d.), or Twitter data are all used frequently by economists and other social scientists. But providing robust and curated archives of data as used by clients over 5 or more years is typically not part of their service.
Research using social media data can pose particular problems for someone who wants to reproduce the study using the same data:
Finally, there a problem of \”cleaning\” data. \”Raw\” data always has errors. Sometimes data isn\’t filled in. Other times it may show a nonsensical finding, like someone having a negative level of income in a year, or an entry where it looks as if several zeros were added to a number by accident. Thus, the data needs to be \”cleaned\” before it\’s used. For well-known data, there are archives of documentation for how data has been cleaned, and why. But for lots of data, the documentation for how it has been cleaned isn\’t available. Vilhuber writes:
While in theory, researchers are able to at least informally describe the data extraction and cleaning processes when run on third-party–controlled systems that are typical of big data, in practice, this does not happen. An informal analysis of various Twitter-related economics articles shows very little or no description of the data extraction and cleaning process. The problem, however, is not unique to big-data articles—most articles provide little if any input data cleaning code in reproducibility archives, in large part because provision of the code that manipulates the input data is only suggested, but not required by most data deposit policies.
As a final thought, I\’ll point out that academic researchers have mixed incentives when it comes to data. They always want access to new data, because new data is often a reliable pathway to published papers that can build a reputation and a paycheck. They often want access to the data used by rival researchers, to understand and to critique their results. But making access available to details of their own data doesn\’t necessarily help them much.
For example, imagine that you write a prominent academic paper, and all the data is widely available. The chances are good that for years to come, your paper will become target practice for economics students and younger faculty members, who want to critique you and to justify all the choices you made in the research. However, you may have a reasonable dislike of spending large chunks of the rest of your career going over the same ground, again and again.
From this standpoint, it\’s perhaps not surprising that while many leading journals of economics now do require that authors publish their computer code and as much of their data as they are allowed to do, the number of papers that get \”exceptions\” for publishing their data is rising. Moreover, the requirement that an author supply data and computer code is not part of what is required for submitting a paper or making a decision about publishing the paper (although other professors refereeing the paper can make a request to see the data and code, if they wish).
It\’s also maybe not a surprise that a study of one prominent journal looked at papers published from 2009 to 2013 and found that of the papers where data was not posted online, only about one-third of the papers had data where it was reasonably straightforward for others to obtain the data.
And it\’s also maybe not a surprise that more and more papers are published with data that you have to be an official researcher to access, through a restricted access data center, which presents some hurdles to those not well-connected in the research community.
Access to data and computer code behind economic research has improved, and improved a lot, since the pre-internet age. But in many cases, it\’s still far from easy.