Annals of Statistics

Statistical inference for model parameters in stochastic gradient descent

Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

The stochastic gradient descent (SGD) algorithm has been widely used in statistical estimation for large-scale data due to its computational and memory efficiency. While most existing works focus on the convergence of the objective function or the error of the obtained solution, we investigate the problem of statistical inference of true model parameters based on SGD when the population loss function is strongly convex and satisfies certain smoothness conditions.

Our main contributions are twofold. First, in the fixed dimension setup, we propose two consistent estimators of the asymptotic covariance of the average iterate from SGD: (1) a plug-in estimator, and (2) a batch-means estimator, which is computationally more efficient and only uses the iterates from SGD. Both proposed estimators allow us to construct asymptotically exact confidence intervals and hypothesis tests.

Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal. This gives a one-pass algorithm for computing both the sparse regression coefficients and confidence intervals, which is computationally attractive and applicable to online data.

Article information

Source
Ann. Statist., Volume 48, Number 1 (2020), 251-273.

Dates
Received: October 2017
Revised: July 2018
First available in Project Euclid: 17 February 2020

Permanent link to this document
https://projecteuclid.org/euclid.aos/1581930134

Digital Object Identifier
doi:10.1214/18-AOS1801

Mathematical Reviews number (MathSciNet)
MR4065161

Subjects
Primary: 62J10: Analysis of variance and covariance 62M02: Markov processes: hypothesis testing
Secondary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Keywords
Stochastic gradient descent asymptotic variance batch-means estimator high-dimensional inference time-inhomogeneous Markov chain

Citation

Chen, Xi; Lee, Jason D.; Tong, Xin T.; Zhang, Yichen. Statistical inference for model parameters in stochastic gradient descent. Ann. Statist. 48 (2020), no. 1, 251--273. doi:10.1214/18-AOS1801. https://projecteuclid.org/euclid.aos/1581930134


Export citation

References

  • Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In Proceedings of the Advances in Neural Information Processing Systems.
  • Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Proceedings of the Advances in Neural Information Processing Systems.
  • Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$. In Proceedings of the Advances in Neural Information Processing Systems.
  • Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547.
  • Bühlmann, P. and Mandozzi, J. (2014). High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Statist. 29 407–430.
  • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.
  • Buja, A., Berk, R., Brown, L., George, E., Traskin, M., Zhang, K. and Zhao, L. (2013). A conspiracy of random $X$ and model violation against classical inference in linear regression. Technical Report, Dept. Statistics, The Wharton School, Univ. Pennsylvania, Philadelphia, PA.
  • Chen, X., Lee, J. D., Tong, X. T. and Zhang, Y. (2020). Supplement to “Statistical inference for model parameters in stochastic gradient descent.” https://doi.org/10.1214/18-AOS1801SUPP.
  • Damerdji, H. (1991). Strong consistency and other properties of the spectral variance estimator. Manage. Sci. 37 1424–1440.
  • Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39 1327–1332.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer Series in Operations Research. Springer, New York.
  • Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Statist. 38 1034–1070.
  • Geyer, C. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7 473–483.
  • Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM J. Optim. 22 1469–1492.
  • Glynn, P. W. and Iglehart, D. L. (1990). Simulation output analysis using standardized time series. Math. Oper. Res. 15 1–16.
  • Glynn, P. W. and Whitt, W. (1991). Estimating the asymptotic variance with batch means. Oper. Res. Lett. 10 431–435.
  • Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
  • Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 101 1537–1547.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
  • Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609.
  • Nesterov, Yu. and Vial, J.-Ph. (2008). Confidence level solutions for stochastic programming. Automatica J. IFAC 44 1559–1568.
  • Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45 158–195.
  • Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838–855.
  • Rakhlin, A., Shamir, O. and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning.
  • Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
  • Roux, N. L., Schmidt, M. and Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Proceedings of the Advances in Neural Information Processing Systems.
  • Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process. Technical Report, Dept. Operations Research and Industrial Engineering, Cornell Univ.
  • Srebro, N. and Tewari, A. (2010). Stochastic optimization for machine learning. Tutorial at International Conference on Machine Learning.
  • Sullivan, T. J. (2015). Introduction to Uncertainty Quantification. Texts in Applied Mathematics 63. Springer, Cham.
  • Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Statist. 45 1694–1727.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11 2543–2596.
  • Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24 2057–2075.
  • Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.

Supplemental materials

  • Supplement to “Statistical inference for model parameters in stochastic gradient descent”. We provide the proofs of all the theorectial results as well as additional simulation studies.