Electronic Journal of Statistics

A variational Bayes approach to variable selection

John T. Ormerod, Chong You, and Samuel Müller

Full-text: Open access


We develop methodology and theory for a mean field variational Bayes approximation to a linear model with a spike and slab prior on the regression coefficients. In particular we show how our method forces a subset of regression coefficients to be numerically indistinguishable from zero; under mild regularity conditions estimators based on our method consistently estimate the model parameters with easily obtainable and (asymptotically) appropriately sized standard error estimates; and select the true model at an exponential rate in the sample size. We also develop a practical method for simultaneously choosing reasonable initial parameter values and tuning the main tuning parameter of our algorithms which is both computationally efficient and empirically performs as well or better than some popular variable selection approaches. Our method is also faster and highly accurate when compared to MCMC.

Article information

Electron. J. Statist., Volume 11, Number 2 (2017), 3549-3594.

Received: June 2017
First available in Project Euclid: 6 October 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Mean field variational Bayes Bernoulli-Gaussian model Markov Chain Monte Carlo

Creative Commons Attribution 4.0 International License.


Ormerod, John T.; You, Chong; Müller, Samuel. A variational Bayes approach to variable selection. Electron. J. Statist. 11 (2017), no. 2, 3549--3594. doi:10.1214/17-EJS1332. https://projecteuclid.org/euclid.ejs/1507255614

Export citation


  • [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In, In Proceedings of the 2nd International Symposium on Information Theory 267–281. Akademiai Kiad6, Budapest.
  • [2] Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions., Journal of the Royal Statistical Society. Series B (Methodological) 36 99–102.
  • [3] Ariascastro, E. and Lounici, K. (2014). Estimation and variable selection with exponential weights., Electronic Journal of Statistics 8 328–354.
  • [4] Bartlett, M. (1957). A Comment on D. V. Lindley’s statistical paradox., Biometrika 44 533–534.
  • [5] Bishop, C. M. (2006)., Pattern Recognition and Machine Learning. Springer, New York.
  • [6] Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (2007)., Discrete multivariate analysis: Theory and Practice. Springer.
  • [7] Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for Bayesian model exploration., Bayesian Analysis 5 583–618.
  • [8] Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection., The Annals of Applied Statistics 5 232–253.
  • [9] Bülmann, P. and van de Geer, S. (2011)., Statistics for High Dimensional Data. Springer.
  • [10] Carbonetto, P. (2012). varbvs 1.10. Variational inference for Bayesian variable selection. R package., http://cran.r-project.org.
  • [11] Carbonetto, P. and Stephens, M. (2011). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies., Bayesian Analysis 6 1–42.
  • [12] Casella, G., Girón, F. J., Martńez, M. L. and Moreno, E. (2009). Consistency of Bayesian procedures for variable selection., The Annals of Statistics 37 1207–1228.
  • [13] Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. W. (2014). Bayesian linear regression with sparse priors., Annals of Statistics 43 1986–2018.
  • [14] Castillo, I. and van der Vaart, A. W. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences., Annals of Statistics 40 2069–2101.
  • [15] Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces., Biometrika 95 759–771.
  • [16] Faes, C., Ormerod, J. T. and Wand, M. P. (2011). Variational Bayesian inference for parametric and nonparametric regression with missing data., Journal of the American Statistical Association 106 959–971.
  • [17] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association 96 1348–1360.
  • [18] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion)., Journal of the Royal Statistical Society, Series B 70 849–911.
  • [19] Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space., Statistica Sinica 20 101-148.
  • [20] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters., The Annals of Statistics 32 928–961.
  • [21] Feldkircher, M. and Zeugner, S. (2009). Benchmark priors revisited: on adaptive shrinkage and the supermodel effect in Bayesian model averaging., IMF Working Paper 09/202.
  • [22] Feldkircher, M. and Zeugner, S. (2013). BMS 03.3. Bayesian Model Averaging Library. R package., http://cran.r-project.org.
  • [23] Flandin, G. and Penny, W. D. (2007). Bayesian fMRI data analysis with sparse spatial basis function priors., NeuroImage 34 1108-1125.
  • [24] Friedman, J., Hastie, T. and Tibshirani, R. (2001)., The Elements of Statistical Learning. Springer.
  • [25] Garcia, T. P., Müller, S., Carroll, R. J., Dunn, T. N., Thomas, A. P., Adams, S. H., Pillai, S. D. and Walzem, R. L. (2013). Structured variable selection with q-values., Biostatistics 14 695–707.
  • [26] Hall, P., Ormerod, J. T. and Wand, M. P. (2011). Theory of Gaussian variational approximation for a Poisson mixed model., Statistica Sinica 21 369–389.
  • [27] Hall, P., Pham, T., Wand, M. P. and Wang, S. S. J. (2011). Asymptotic normality and valid inference for Gaussian variational approximation., The Annals of Statistics 39 2502–2532.
  • [28] Hans, C., Dobra, A. and West, M. (2007). Shotgun stochastic search for “large $p$” regression., Journal of the American Statistical Association 102 507–516.
  • [29] Hastie, T. and Efron, B. (2013). lars 1.2. Least angle regression, lasso and forward stagewise regression. R package., http://cran.r-project.org.
  • [30] Horn, R. A. and Johnson, C. R. (2012)., Matrix Analysis. Cambridge University Press.
  • [31] Hsu, D., Kakade, S. and Zhang, T. (2014). Random design analysis of ridge regression., Foundations of Computational Mathematics 14 569-600.
  • [32] Huang, J. C., Morris, Q. D. and Frey, B. J. (2007). Bayesian inference of MicroRNA targets from sequence and expression data., Journal of Computational Biology 14 550–563.
  • [33] Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings., Journal of the American Statistical Association 107 649-660.
  • [34] Johnstone, I. M. and Titterington, D. M. (2009). Statistical challenges of high-dimensional data., Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 4237-4253.
  • [35] Jordan, M. I. (2004). Graphical models., Statistical Science 19 140-155.
  • [36] Lai, R. C. S., Hannig, J. and Lee, T. C. M. (2015). Generalized fiducial inference for ultrahigh dimensional regression., Journal of the American Statistical Association 110 760–772.
  • [37] Li, S. M. J. Z. (2012). Estimation of quantitative trait locus effects with epistasis by variational Bayes algorithms., Genetics 190 231–249.
  • [38] Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics., Journal of the American Statistical Association 105 1202–1214.
  • [39] Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of $g$ priors for Bayesian variable selection., Journal of the American Statistical Association 103 410–423.
  • [40] Logsdon, B. A., Hoffman, G. E. and Mezey, J. G. (2010). A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis., BMC Bioinformatics 11 1–13.
  • [41] Luenberger, D. G. and Ye, Y. (2008)., Linear and Nonlinear Programming, 3rd edition ed. Springer, New York.
  • [42] Luts, J. and Ormerod, J. T. (2014). Mean field variational Bayesian inference for support vector machine classification., Computational Statistics and Data Analysis 73 163–176.
  • [43] Mallows, C. L. (1973). Some comments on Cp., Technometrics 15 661–675.
  • [44] Martin, R., Mess, R. and Walker, S. G. Empirical Bayes posterior concentration in sparse high-dimensional linear models., Bernoulli 23.
  • [45] Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector., Electronic Journal of Statistics 8 2188–2206.
  • [46] Maruyama, Y. and George, E. I. (2011). Fully Bayes factors with a generalized $g$-prior., The Annals of Statistics 39 2740–2765.
  • [47] Müller, S. and Welsh, A. H. (2010). On model selection curves., International Statistical Review 78 240–256.
  • [48] Murphy, K. P. (2012)., Machine Learning: A Probabilistic Perspective. The MIT Press, London.
  • [49] Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors., The Annals of Statistics 42 789–817.
  • [50] Nathoo, F. S., Babul, A., Moiseev, A., Virji-Babul, N. and Beg, M. F. (2014). A variational Bayes spatiotemporal model for electromagnetic brain mapping., Biometrics 70 132–143.
  • [51] Nott, D. J. and Kohn, R. (2005). Adaptive sampling for Bayesian variable selection., Biometrika 92 747–763.
  • [52] O’Hara, R. B. and Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: what, how and which., Bayesian Analysis 4 85–117.
  • [53] Ormerod, J. T. and Wand, M. P. (2010). Explaining variational approximations., The American Statistician 64 140–153.
  • [54] Pham, T. H., Ormerod, J. T. and Wand, M. P. (2013). Mean field variational Bayesian inference for nonparametric regression with measurement error., Computational Statistics and Data Analysis 68 375–387.
  • [55] Rattray, M., Stegle, O., Sharp, K. and Winn, J. (2009). Inference algorithms and learning theory for Bayesian sparse factor analysis. In, Journal of Physics: Conference Series 197 012002.
  • [56] Redmond, M. and Baveja, A. (2002). A data-driven software tool for enabling cooperative information sharing among police departments., European Journal of Operational Research 141 660–678.
  • [57] Ročková, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection., Journal of the American Statistical Association 109 828-846.
  • [58] Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations., Journal of the Royal Statistical Society, Series B 71 319–392.
  • [59] Schwarz, G. (1978). Estimating the dimension of a model., The Annals of Statistics 6 461–464.
  • [60] Soussen, C., Idier, J., Brie, D. and Duan, J. (2011). From Bernoulli–Gaussian deconvolution to sparse signal restoration., Signal Processing, IEEE Transactions on 59 4572–4584.
  • [61] Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A. and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. radical prostatectomy treated patients., Journal of Urology 141 1076–1083.
  • [62] Stingo, F. C. and Vannucci, M. (2011). Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data., Bioinformatics 27 495–501.
  • [63] Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D. and Caldas, C. (2005). A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data., Bioinformatics 21 3025-3033.
  • [64] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., Journal of the Royal Statatistical Society, Series B 58 267–288.
  • [65] Ueda, N. and Nakano, R. (1998). Deterministic annealing EM algorithm., Neural Networks 11 271–282.
  • [66] Van Rijsbergen, C. J. (1979)., Information Retrieval (2nd ed.). Butterworth.
  • [67] Wand, M. P. and Ormerod, J. T. (2011). Penalized wavelets: Embedding wavelets into semiparametric regression., Electronic Journal of Statistics 5 1654–1717.
  • [68] Wand, M. P., Ormerod, J. T., Padoan, S. A. and Frühwirth, R. (2011). Mean field variational Bayes for elaborate distributions., Bayesian Analysis 6 847–900.
  • [69] Wang, H. (2009). Forward regression for ultra-high dimensional variable screening., Journal of the American Statistical Association 104 1512–1524.
  • [70] Wang, X. and Chen, L. (2016). High dimensional ordinary least squares projection for screening variables., Journal of The Royal Statistical Society Series B 78 589–611.
  • [71] Wang, B. and Titterington, D. M. (2006). Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model., Bayesian Analysis 1 625–650.
  • [72] Xu, S. (2007). An empirical Bayes method for estimating epistatic effects of quantitative trait loci., Biometrics 63 513–521.
  • [73] You, C., Ormerod, J. T. and Müller, S. (2014). On variational Bayes estimation and variational information criteria for linear regression models., Australian and New Zealand Journal of Statistics 56 83–87.
  • [74] Zellner, A. (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. In, Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. K. Goel and A. Zellner, eds.) 233–243. North-Holland/Elsevier.