The measurement of a quantity is reproducible when mutually independent, multiple measurements made of it yield mutually consistent measurement results, that is, when the measured values, after due allowance for their associated uncertainties, do not differ significantly from one another. Interlaboratory comparisons organized deliberately for the purpose, and meta-analyses that are structured so as to be fit for the same purpose, are procedures of choice to ascertain measurement reproducibility.
The realistic evaluation of measurement uncertainty is a key preliminary to the assessment of reproducibility because lack of reproducibility manifests itself as dispersion or variability of measured values in excess of what their associated uncertainties suggest that they should exhibit. For this reason, we review the distinctive traits of measurement in the physical sciences and technologies, including medicine, and discuss the meaning and expression of measurement uncertainty.
This contribution illustrates the application of statistical models and methods to quantify measurement uncertainty and to assess reproducibility in four concrete, real-life examples, in the process revealing that lack of reproducibility can be a consequence of one or more of the following: intrinsic differences between laboratories making measurements; choice of statistical model and of procedure for data reduction or of causes yet to be identified.
Despite the instances of lack of reproducibility that we review, and many others like them, the outlook is optimistic. First, because “lack of reproducibility is not necessarily bad news; it may herald new discoveries and signal scientific progress” (Nat. Phys. 16 (2020) 117–119). Second, and as the example about the measurement of the Newtonian constant of gravitation, G, illustrates, when faced with a reproducibility crisis the scientific community often engages in cooperative efforts to understand the root causes of the lack of reproducibility, leading to advances in scientific knowledge.
The author is immensely grateful to Stefan Schlamminger (NIST) for all that he has taught him over the years about the measurement of G. The author is also much indebted to Olha Bodnar (Örebro University, Sweden), David Newton (NIST) and Mikela Waldman (NIST and Georgetown University, Washington, DC) for their most valuable and extensive suggestions for improvement of a draft of this contribution. The author thanks David Woods (Univ. of Southampton, UK) for an exchange of eMails about the measurement of the reproduction number of COVID-19 in the United Kingdom.
The author thanks the organizers of the special issue of Statistical Science dedicated to the issue of reproducibility for the invitation to contribute to it, and acknowledges the very helpful criticism and guidance that the guest editors, the journal’s editor and a referee provided throughout the revision process, which led to considerable improvements.
Some specific commercial entities, equipment or materials may be identified in this document in order to describe or illustrate an experimental or statistical procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor is it intended to imply that the entities, equipment or materials mentioned are necessarily the best available for the purpose.
"Tracking Truth Through Measurement and the Spyglass of Statistics." Statist. Sci. 38 (4) 655 - 671, November 2023. https://doi.org/10.1214/23-STS899