Statistics Surveys

Pitfalls of significance testing and $p$-value variability: An econometrics perspective

Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, and Claudia Becker

Full-text: Open access

Abstract

Data on how many scientific findings are reproducible are generally bleak and a wealth of papers have warned against misuses of the $p$-value and resulting false findings in recent years. This paper discusses the question of what we can(not) learn from the $p$-value, which is still widely considered as the gold standard of statistical validity. We aim to provide a non-technical and easily accessible resource for statistical practitioners who wish to spot and avoid misinterpretations and misuses of statistical significance tests. For this purpose, we first classify and describe the most widely discussed (“classical”) pitfalls of significance testing, and review published work on these misuses with a focus on regression-based “confirmatory” study. This includes a description of the single-study bias and a simulation-based illustration of how proper meta-analysis compares to misleading significance counts (“vote counting”). Going beyond the classical pitfalls, we also use simulation to provide intuition that relying on the statistical estimate “$p$-value” as a measure of evidence without considering its sample-to-sample variability falls short of the mark even within an otherwise appropriate interpretation. We conclude with a discussion of the exigencies of informed approaches to statistical inference and corresponding institutional reforms.

Article information

Source
Statist. Surv., Volume 12 (2018), 136-172.

Dates
Received: November 2017
First available in Project Euclid: 4 October 2018

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1538618436

Digital Object Identifier
doi:10.1214/18-SS122

Mathematical Reviews number (MathSciNet)
MR3860867

Zentralblatt MATH identifier
06976331

Keywords
Meta-analysis multiple testing $p$-hacking publication bias $p$-value misinterpretations $p$-value sample-to-sample variability statistical inference statistical significance

Rights
Creative Commons Attribution 4.0 International License.

Citation

Hirschauer, Norbert; Grüner, Sven; Mußhoff, Oliver; Becker, Claudia. Pitfalls of significance testing and $p$-value variability: An econometrics perspective. Statist. Surv. 12 (2018), 136--172. doi:10.1214/18-SS122. https://projecteuclid.org/euclid.ssu/1538618436


Export citation

References

  • Altman, N., Krzywinski, M. (2017): Points of significance: P values and the search for significance. Nature Methods 14(1): 3–4.
  • Amrhein, V., Korner-Nievergelt, F., Roth, T. (2017): The earth is flat ($p>0.05$): significance thresholds and the crisis of unreplicable research. PeerJ, doi: 10.7717/peerj.3544.
  • Armstrong, J.S. (2007): Significance tests harm progress in forecasting. International Journal of Forecasting 23(2): 321–327.
  • Auspurg, K., Hinz, T. (2011): What Fuels Publication Bias? Theoretical and Empirical Analyses of Risk Factors Using the Caliper Test. Journal of Economics and Statistics 231(5-6): 636–660.
  • Baker, M. (2016): Statisticians issue warning on $P$ values. Nature 531(7593): 151.
  • Becker, B.J., Wu, M-J. (2007): The Synthesis of Regression Slopes in Meta-Analysis. Statistical Science 22(3): 414–429.
  • Benjamini, Y., Hochberg, Y. (1995): Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1): 289–300.
  • Bennett, D.A., Latham, N.K., Stretton, C., Anderson, C.S. (2004): Capture-recapture is a potentially useful method for assessing publication bias. Journal of Clinical Epidemiology 57(4): 349–357.
  • Berning, C., Weiß, B. (2016): Publication Bias in the German Social Sciences: An Application of the Caliper Test to Three Top-Tier German Social Science Journals. Quality & Quantity 50(2): 901–917.
  • Berry, D.A. (2016): P-Values Are Not What They’re Cracked Up to Be. Online Discussion: ASA Statement on Statistical Significance and P-values. The American Statistician 70(2): 1–2.
  • Berry, D. (2017): A p-Value to Die For. Journal of the American Statistical Association 112(519): 895–897.
  • Boos, D.D., Stefanski, L.A. (2011): P-Value Precision and Reproducibility. The American Statistician 65(4): 213–221.
  • Borenstein, M., Hedges, L.V., Higgins, J.P.T., Rothstein, H.R. (2009): Introduction to Meta-Analysis. Chichester: John Wiley & Sons.
  • Bretz, F., Hothorn, T., Westfall, P. (2010): Multiple comparisons using R. Boca Raton: CRC Press.
  • Brodeur, A., Lé, M., Sangnier, M., Zylberberg, Y. (2016): Star Wars: The Empirics Strike Back. American Economic Journal: Applied Economics 8(1): 1–32.
  • Bruns, S.B. (2017): Meta-Regression Models and Observational Research. Oxford Bulletin of Economics and Statistics 0305–9049, doi: 10.1111/obes.12172.
  • Card, D., Krueger, A. B. (1995): Time-series minimum-wage studies: A meta-analysis. American Economic Review (AEA Papers and Proceedings) 85: 238–243.
  • Card, N. A. (2012): Applied meta-analysis for social science research. New York: Guilford Press.
  • Cohen, J. (1994): The earth is round ($p<0.05$). American Psychologist 49(12): 997–1003.
  • Cooper, D.J., Dutcher, E.G. (2011): The dynamics of responder behavior in ultimatum games: a meta-study. Experimental Economics 14(4): 519–546.
  • Cooper, H., Hedges, L., Valentine. J. (eds.) (2009): The handbook or research synthesis and meta-analysis. 2nd ed., Russell Sage Foundation, New York.
  • Crouch, G.I. (1995): A meta-analysis of tourism demand. Annals of Tourism Research 22(1): 103–118.
  • Cumming, G. (2008): Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science 3(4): 286–300.
  • Denton, F.T. (1985): Data Mining as an Industry. Review of Economics and Statistics 67(1): 124–127.
  • Denton, F.T. (1988): The significance of significance: Rhetorical aspects of statistical hypothesis testing in economics. In: Klamer, A., McCloskey, D.N., Solow, R.M. (eds.): The consequences of economic rhetoric. Cambridge: Cambridge University Press: 163–193.
  • Didelez, V., Pigeot, I., Walter, P. (2006): Modifications of the Bonferroni-Holm procedure for a multi-way ANOVA. Statistical Papers 47: 181–209.
  • Duvendack, M., Palmer-Jones, R., Reed, W.R. (2015): Replications in Economics: A Progress Report. Econ Journal Watch 12(2): 164–191.
  • Duvendack, M., Palmer-Jones, R., Reed, W.R. (2017): What Is Meant by “Replication” and Why Does It Encounter Resistance in Economics? American Economic Review: Papers & Proceedings 2017: 107(5): 46–51.
  • Egger, M., Smith, G.D., Schneider, M., Minder, C. (1997): Bias in meta-analysis detected by a simple, graphical test. British Medical Journal 315 (7109): 629–634.
  • Engel, C. (2011): Dictator games: a meta study. Experimental Economics 14(4): 583–610.
  • Evanschitzky, H., Armstrong, J.S. (2010): Replications of forecasting research. International Journal of Forecasting 26: 4–8.
  • Fanelli, D. (2010): Positive” results increase down the hierarchy of the sciences. PLoS One 5(4): e10068.
  • Fanelli, D. (2011): Negative results are disappearing from most disciplines and countries. Scientometrics 90(3): 891–904.
  • Fisher, R.A. (1925): Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
  • Fisher, R.A. (1935): The design of experiments. Edinburgh: Oliver & Boyd.
  • Fitzpatrick, L., Parmeter, C.F., Agar, J. (2017): Threshold Effects in Meta-Analyses With Application to Benefit Transfer for Coral Reef Valuation. Ecological Economics 133: 74–85.
  • Gelman, A., Carlin, J. (2017): Some natural solutions to the p-value communication problem-and why they won’t work. Blogsite: Statistical Modeling, Causal Inference, and Social Science.
  • Gerber, A. S., N. Malhotra (2008): Publication Bias in Empirical Sociological Research. Do Arbitrary Significance Levels Distort Published Results? Sociological Methods & Research 37(1): 3–30.
  • Gerber, A.S., Malhotra, N., Dowling, C.M., Doherty, D. (2010): Publication Bias in Two Political Behavior Literatures. American Politics Research 38(4): 591–613.
  • Gigerenzer, G., Krauss, S., Vitouch, O. (2004): The null ritual: what you always wanted to know about significance testing but were afraid to ask. In: Kaplan, D. (ed.): The SAGE handbook of quantitative methodology for the social sciences (Chapter 21). Thousand Oaks: Sage.
  • Gigerenzer, G., Marewski, J.N. (2015): Surrogate Science: The Idol of a Universal Method for Statistical Inference. Bayesian Probability and Statistics in Management Research, Special Issue of the Journal of Management 41(2): 421–440.
  • Goodman, S. (2008): A dirty dozen: Twelve $p$-value Misconceptions. Seminars in Hematology 45: 135–140.
  • Goodman, S.N. (1992): A Comment of Replication, P-Values and Evidence. Statistics in Medicine 11: 875–879.
  • Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G. (2016): Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31(4): 337–350.
  • Greenland, S. (2017): Invited Commentary: the Need for Cognitive Science in Methodology. American Journal of Epidemiology 186(6): 639–645.
  • Haller, H., Krauss, S. (2002): Misinterpretations of Significance: A Problem Students Share with Their Teachers? Methods of Psychological Research Online 7(1): 1–20.
  • Halsey, L.G., Curran-Everett, D., Vowler, S.L., Drummond, B. (2015): The fickle P value generates irreproducible results. Nature Methods 12(3): 179–185.
  • Hartung, J., Knapp, G., Sinha, B.K. (2008): Statistical Meta-Analysis with Applications. Hoboken: John Wiley & Sons.
  • Head, M.L, Holman, L., Lanfear, R., Kahn, A.T., Jennions, M.D. (2015): The Extent and Consequences of P-Hacking in Science. PLoS Biology 13(3): e1002106, doi: 10.1371/journal.pbio.1002106.
  • Hirschauer, N., Mußhoff, O., Grüner, S., Frey, U., Theesfeld, I., Wagner, P. (2016): Inferential misconceptions and replication crisis. Journal of Epidemiology, Biostatistics, and Public Health 13(4): e12066-1–e12066-16.
  • Hochberg, Y., Tamhane, A.C. (1987). Multiple comparison procedures. New York: Wiley.
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6(2): 65–70.
  • Howard, G.S., Maxwell, S.E., Fleming, K.J. (2000): The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis. Psychological Methods 5: 315–332.
  • Ioannidis, J., Doucouliagos, C. (2013): What’s to know about the credibility of empirical economics? Journal of Economic Surveys 27(5): 997–1004.
  • Ioannidis, J.P.A. (2005): Why Most Published Research Findings are False. PLoS Medicine 2(8): e124: 0696-0701.
  • Joober, R., Schmitz, N., Dipstat, L.A., Boksa, P. (2012): Publication bias: What are the challenges and can they be overcome? Journal of Psychiatry & Neuroscience 37(3): 149–152.
  • Kerr, N.L. (1998): HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review 2(3): 196–217.
  • Kicinski, M, Springate, D.A., Kontopantelis, E. (2015): Publication bias in meta-analyses from the Cochrane Database of Systematic Reviews. Statistics in Medicine 34: 2781–2793.
  • Kline, R.B. (2013): Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.
  • Krämer, W. (2011): The Cult of Statistical Significance – What Economists Should and Should Not Do to Make their Data Talk. Schmollers Jahrbuch 131(3): 455–468.
  • Lange, T. (2016): Discrimination in the laboratory: A meta-analysis of economics experiments. European Economic Review 90: 375–402.
  • Leamer, E.E. (1978): Specification Searches: Ad Hoc Inference with Nonexperimental Data. New York: Wiley.
  • Lecoutre, B., Poitevineau, J. (2014): The Significance Test Controversy Revisited. The Fiducial Bayesian Alternative. Heidelberg: Springer.
  • Light, R.J., Pillemer, D.B. (1984): Summing Up: The Science of Reviewing Research. Cambridge: Harvard University Press.
  • List, J.A., Shaikh, A.M., Xu, Y. (2016): Multiple Hypothesis Testing in Experimental Economics. No. w21875. National Bureau of Economic Research, Working Paper No. 21875.
  • Loomis, J.B., White, D.S. (1996): Economic benefits of rare and endangered species: summary and meta-analysis. Ecological Economics 18(3): 197–206.
  • Lovell, M.C. (1983): Data Mining. Review of Economics and Statistics 65(1): 1–12.
  • McCloskey, D.N., Ziliak, S.T. (1996): The Standard Error of Regressions. Journal of Economic Literature 34(1): 97–114.
  • McShane, B., Gal, D., Gelman, A., Robert, C., Tackett, J.L. (2017): Abandon Statistical Significance. http://www.stat.columbia.edu/~gelman/research/unpublished/abandon.pdf
  • Motulsky, J.J. (2014): Common Misconceptions about Data Analysis and Statistics. The Journal of Pharmacology and Experimental Theurapeutics 351(8): 200–205.
  • Munafò, M.R., Nosek, B.A., Bishop, D.V.M., Button, K.S., Chambers, C.D., du Sert, N.P., Simonsohn, U., Wagenmakers, E-J., Ware, J.J., Ioannidis, J.P.A. (2017): A manifesto for reproducible science. Nature Human Behaviour 1(0021): 1–8.
  • Nickerson, R.S. (2000): Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods 5(2): 241–301.
  • Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018): The preregistration revolution. Proceedings of the National Academy of Sciences of the United States of America 115(11): 2600–2606.
  • Nuzzo, R. (2014): Statistical Errors. $P$-values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 506(7487): 150–152.
  • Oakes, M. (1986): Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.
  • Pigeot, I. (2000): Basic concepts of multiple tests – A survey. Invited paper. Statistical Papers 41: 3–36.
  • Pitchforth, J.O., Mengersen, K.L. (2013): Bayesian Meta-Analysis. In: Alston, C.L., Mengersen, K.L., Pettitt, A.N. (eds.): Case Studies in Bayesian Statistical Modelling and Analysis. Chichester: John Wiley & Sons, Ltd.: 118–140.
  • Poorolajal, J., Haghdoost, A.A., Mahmoodi, M., Majdzadeh, R., Nasseri-Moghaddam, S., Fotouhi, A. (2010): Capture-recapture method for assessing publication bias. Journal of Research in Medical Sciences: The Official Journal of Isfahan University of Medical Sciences 15(2): 107–115.
  • Roberts, C.J. (2005): Issues in meta-regression analysis: An overview. Journal of Economic Surveys 19(3): 295–298.
  • Romano, J.P., Shaikh, A.M., Wolf, M. (2010): Multiple Testing. In: Palgrave Macmillan (eds.) The New Palgrave Dictionary of Economics. London: Palgrave Macmillan, doi: 10.1057/978-1-349-95121-5_2914-1.
  • Rosenberg, M.S. (2005): The File-drawer Problem Revisited: A General Weighted Method for Calculating Fail-Safe Numbers in Meta-Analysis. Evolution 59(2): 464–468,
  • Rosenthal, R. (1979): The file drawer problem and tolerance for null results. Psychological Bulletin 86(3): 638–641.
  • Rothstein, H., Sutton, A.J., Borenstein, M. (2005): Publication Bias in Meta-Analysis. Prevention, Assessment and Adjustments. Sussex: Wiley.
  • Schmidt, F.L., Hunter, J.E. (2014): Methods of meta-analysis: Correcting error and bias in research findings. Los Angeles: Sage publications.
  • Silliman, N. (1997): Hierarchical selection models with applications in meta-analysis. Journal of American Statistical Association 92(439): 926–936.
  • Simmons, J.P., Nelson, L.D., Simonsohn U. (2011): False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science 22(11): 1359–1366.
  • Simonsohn, U., Nelson, L.D., Simmons, J.P. (2014): $P$-Curve: A Key to the File-Drawer. Journal of Experimental Psychology 143(2): 534–547.
  • Smith, M.L. (1980): Publication bias and meta-analysis. Evaluation in Education 4: 22–24.
  • Song, F., Eastwood, A.J., Gilbody, S., Duley, L., Sutton, A.J. (2000): Publication and related biases. Southampton: The National Coordinating Centre for Health Technology Assessment.
  • Song, F., Hooper, L., Loke, Y.K. (2013): Publication bias: what is it? How do we measure it? How do we avoid it? Open Access Journal of Clinical Trials 5: 71–81.
  • Stanley, T.D., Jarrell, S. B. (1989): Meta-regression analysis: A quantitative method of literature surveys. Journal of Economic Surveys 3(2): 161–170.
  • Stanley, T.D., Doucouliagos, H. (2012): Meta-Regression Analysis in Economics and Business. London: Routledge.
  • Sterling, T.D. (1959): Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance – Or Vice Versa. Journal of the American Statistical Association 54(285): 30–34.
  • Sterne, J.A.C., Egger, M. (2005): Regression Methods to Detect Publication and Other Bias in Meta-Analysis. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.): Publication Bias in Meta-Analysis. Prevention, Assessment and Adjustments. Chichester: Wiley: 99–110.
  • Sterne, J.A.C., Egger, M., Moher, D. (2008): Addressing reporting biases. In: Higgins, J.P.T., Green, S. (eds.): Cochrane handbook for systematic reviews of interventions: 297–333. Chichester: Wiley.
  • Trafimow, D. et al. (2017): Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology 9: 699, doi: 10.3389/fpsyg.2018.00699.
  • Van Houtven, G.L., Pattanayak, S.K., Usmani, F., Yang, J.C. (2017): What are Households Willing to Pay for Improved Water Access? Results from a Meta-Analysis. Ecological Economics 136: 126–135.
  • Vogt, W.P., Vogt, E.R., Gardner, D.C., Haeffele, L.M. (2014): Selecting the right analyses for your data: quantitative, qualitative, and mixed methods. New York: The Guilford Publishing.
  • Wasserstein, R.L., Lazar N.A. (2016): The ASA’s statement on p-values: context, process, and purpose, The American Statistician 70(2): 129–133.
  • Weiß, B., Wagner, M. (2011): The identification and prevention of publication bias in the social sciences and economics. Jahrbücher für Nationalökonomie und Statistik 231(5-6): 661–684.
  • Westfall, P., Tobias, R., Wolfinger, R. (2011): Multiple comparisons and multiple testing using SAS. Cary: SAS Institute.
  • Zelmer, J. (2003): Linear public goods experiments: A meta-analysis. Experimental Economics 6(3): 299–310.
  • Ziliak, S.T., McCloskey, D.N. (2008): The Cult of Statistical Significance. How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: The University of Michigan Press.
  • Zyphur, M.J., Oswald, F.L. (2015): Bayesian Estimation and Inference: A User’s Guide. Bayesian Probability and Statistics in Management Research, Special Issue of the Journal of Management 41(2): 390–420.