Electronic Journal of Statistics

Categorizing a continuous predictor subject to measurement error

Betsabé G. Blas Achic, Tianying Wang, Ya Su, Victor Kipnis, Kevin Dodd, and Raymond J. Carroll

Full-text: Open access


Epidemiologists often categorize a continuous risk predictor, even when the true risk model is not a categorical one. Nonetheless, such categorization is thought to be more robust and interpretable, and thus their goal is to fit the categorical model and interpret the categorical parameters. We address the question: with measurement error and categorization, how can we do what epidemiologists want, namely to estimate the parameters of the categorical model that would have been estimated if the true predictor was observed? We develop a general methodology for such an analysis, and illustrate it in linear and logistic regression. Simulation studies are presented and the methodology is applied to a nutrition data set. Discussion of alternative approaches is also included.

Article information

Electron. J. Statist., Volume 12, Number 2 (2018), 4032-4056.

Received: January 2018
First available in Project Euclid: 11 December 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Categorization differential misclassification epidemiology practice inverse problems measurement error

Creative Commons Attribution 4.0 International License.


Blas Achic, Betsabé G.; Wang, Tianying; Su, Ya; Kipnis, Victor; Dodd, Kevin; Carroll, Raymond J. Categorizing a continuous predictor subject to measurement error. Electron. J. Statist. 12 (2018), no. 2, 4032--4056. doi:10.1214/18-EJS1489. https://projecteuclid.org/euclid.ejs/1544518836

Export citation


  • [1] Arem, H., Reedy, J., Sampson, J., Jiao, L., Hollenbeck, A. R., Risch, H., Mayne, S. T., and Stolzenberg-Solomon, R. Z. (2013). The Healthy Eating Index 2005 and risk for pancreatic cancer in the NIH–AARP Study., Journal of the National Cancer Institute 105, 1298–1305.
  • [2] Berry, S. M., Carroll, R. J., and Ruppert, D. (2002). Bayesian smoothing and regression splines for measurement error problems., Journal of the American Statistical Association 97, 457, 160–169.
  • [3] Buonaccorsi, J. P. (2010)., Measurement Error: Models, Methods and Applications. Chapman & Hall.
  • [4] Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006)., Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition. Chapman and Hall.
  • [5] Chaix, B., Kestens, Y., Duncan, D. T., Brondeel, R., Méline, J., Aarbaoui, T. E., Pannier, B., and Merlo, J. (2016). A gps-based methodology to analyze environment-health associations at the trip level: case-crossover analyses of built environments and walking., American Journal of Epidemiology 184, 8, 579–589.
  • [6] Cook, J. R. and Stefanski, L. (1994). Simulation-extrapolation estimation in parametric measurement error models., Journal of the American Statistical Association 89, 1314–1328.
  • [7] Cordy, C. B. and Thomas, D. R. (1997). Deconvolution of a distribution function., Journal of the American Statistical Association 92, 1459–1465.
  • [8] Devanarayan, V. and Stefanski, L. A. (2002). Empirical simulation extrapolation for measurement error models with replicate measurements., Statistics & Probability Letters 59, 219–225.
  • [9] Eckert, R. S., Carroll, R. J., and Wang, N. (1997). Transformations to additivity in measurement error models., Biometrics 53, 262–272.
  • [10] Evenson, K. R., Wen, F., and Herring, A. H. (2016). Associations of accelerometry-assessed and self-reported physical activity and sedentary behavior with all-cause and cardiovascular mortality among us adults., American Journal of Epidemiology 184, 10, 621–632.
  • [11] Flegal, K. M., Keyl, P. M., and Nieto, F. J. (1991). Differential misclassification arising from nondifferential errors in exposure measurement., American Journal of Epidemiology 134, 10, 1233–1246.
  • [12] Ganguli, B., Staudenmayer, J., and Wand, M. P. (2005). Additive models with predictors subject to measurement error., Australian & New Zealand Journal of Statistics 47, 2, 193–202.
  • [13] Gustafson, P. (2004)., Measurement Error and Misclassication in Statistics and Epidemiology. Chapman and Hall/CRC.
  • [14] Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1, 221–233.
  • [15] Kauermann, G. and Carroll, R. J. (2001). A note on the efficiency of sandwich covariance matrix estimation., Journal of the American Statistical Association 96, 456, 1387–1396.
  • [16] Lederer, W. and Küchenhoff, H. (2006). A short introduction to the simex and mcsimex., The Newsletter of the R Project Volume 6/4, October 2006, 26.
  • [17] Nakamura, T. (1990). Corrected score function for errors-in-variables models: methodology and application to generalized linear models., Biometrika 77, 127–137.
  • [18] Pham, T. H., Ormerod, J. T., and Wand, M. P. (2013). Mean field variational Bayesian inference for nonparametric regression with measurement error., Computational Statistics & Data Analysis 68, 375–387.
  • [19] Reedy, J., Wirfält, E., Flood, A., Mitrou, P. N., Krebs-Smith, S. M., Kipnis, V., Midthune, D., Leitzmann, M., Hollenbeck, A., Schatzkin, A., and others. (2010). Comparing 3 dietary pattern methods – cluster analysis, factor analysis, and index analysis – with colorectal cancer risk: the NIH–AARP Diet and Health Study., American Journal of Epidemiology 171, 479–487.
  • [20] Reedy, J. R., Mitrou, P. N., Krebs-Smith, S. M., Wirfält, E., Flood, A. V., Kipnis, V., Leitzmann, M., Mouwand, T., Hollenbeck, A., Schatzkin, A., and Subar, A. F. (2008). Index-based dietary patterns and risk of colorectal cancer: the NIH-AARP Diet and Health Study., American Journal of Epidemiology 168, 38–48.
  • [21] Sarkar, A., Mallick, B. K., Staudenmayer, J., Pati, D., and Carroll, R. J. (2014). Bayesian semiparametric density deconvolution in the presence of conditionally heteroscedastic measurement errors., Journal of Computational and Graphical Statistics 23, 1101–1125.
  • [22] Stefanski, L. A. and Cook, J. R. (1995). Simulation-extrapolation: the measurement error jackknife., Journal of the American Statistical Association 90, 1247–1256.
  • [23] Subar, A. F., Thompson, F. E., Kipnis, V., Mithune, D., Hurwitz, P., McNutt, S., McIntosh, A., and Rosenfeld, S. (2001). Comparative validation of the Block, Willett, and National Cancer Institute food frequency questionnaires: The Eating at America’s Table Study., American Journal of Epidemiology 154, 1089–1099.
  • [24] Trentham-Dietz, A., Newcomb, P. A., B, E. S., Longnecker, M. P., Baron, J., Greenberg, E. R., and Willett, W. C. (1997). Body size and risk of breast cancer., American Journal of Epidemiology 145, 11, 1011–1019.
  • [25] Wang, Y., Wellenius, G. A., Hickson, D. A., Gjelsvik, A., Eaton, C. B., and Wyatt, S. B. (2016). Residential proximity to traffic-related pollution and atherosclerosis in 4 vascular beds among African-American adults: Results from the Jackson Heart Study., American Journal of Epidemiology 184, 10, 732–743.
  • [26] White, H. (1982). Maximum likelihood estimation of misspecified models., Econometrica 50, 1–25.
  • [27] Yi, G. Y. (2017)., Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. Springer.
  • [28] Zeger, S. L., Liang, K.-Y., and Albert, P. S. (1988). Models for longitudinal data: a generalized estimating equation approach., Biometrics 44, 1049–1060.