## The Annals of Statistics

### Bayesian fractional posteriors

#### Abstract

We consider the fractional posterior distribution that is obtained by updating a prior distribution via Bayes theorem with a fractional likelihood function, a usual likelihood function raised to a fractional power. First, we analyze the contraction property of the fractional posterior in a general misspecified framework. Our contraction results only require a prior mass condition on certain Kullback–Leibler (KL) neighborhood of the true parameter (or the KL divergence minimizer in the misspecified case), and obviate constructions of test functions and sieves commonly used in the literature for analyzing the contraction property of a regular posterior. We show through a counterexample that some condition controlling the complexity of the parameter space is necessary for the regular posterior to contract, rendering additional flexibility on the choice of the prior for the fractional posterior. Second, we derive a novel Bayesian oracle inequality based on a PAC-Bayes inequality in misspecified models. Our derivation reveals several advantages of averaging based Bayesian procedures over optimization based frequentist procedures. As an application of the Bayesian oracle inequality, we derive a sharp oracle inequality in multivariate convex regression problems. We also illustrate the theory in Gaussian process regression and density estimation problems.

#### Article information

Source
Ann. Statist., Volume 47, Number 1 (2019), 39-66.

Dates
Revised: April 2018
First available in Project Euclid: 30 November 2018

https://projecteuclid.org/euclid.aos/1543568581

Digital Object Identifier
doi:10.1214/18-AOS1712

Mathematical Reviews number (MathSciNet)
MR3909926

Zentralblatt MATH identifier
07036194

#### Citation

Bhattacharya, Anirban; Pati, Debdeep; Yang, Yun. Bayesian fractional posteriors. Ann. Statist. 47 (2019), no. 1, 39--66. doi:10.1214/18-AOS1712. https://projecteuclid.org/euclid.aos/1543568581

#### References

• [1] Alquier, P., Ridgway, J. and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 17 Paper No. 239, 41.
• [2] Balázs, G., György, A. and Szepesvári, C. (2015). Near-optimal max-affine estimators for convex regression. In AISTATS.
• [3] Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist. 27 536–561.
• [4] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• [5] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
• [6] Bartlett, P. L., Mendelson, S. and Philips, P. (2004). Local complexities for empirical risk minimization. In Learning Theory. Lecture Notes in Computer Science 3120 270–284. Springer, Berlin.
• [7] Bellec, P. C. and Tsybakov, A. B. (2015). Sharp oracle bounds for monotone and convex regression through aggregation. J. Mach. Learn. Res. 16 1879–1892.
• [8] Bhattacharya, A., Pati, D. and Yang, Y. (2019). Supplement to “Bayesian fractional posteriors.” DOI:10.1214/18-AOS1712SUPP.
• [9] Birgé, L. (1984). Sur un théorème de minimax et son application aux tests. Probab. Math. Statist. 3 259–282.
• [10] Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 1103–1130.
• [11] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. IMS, Beachwood, OH.
• [12] Catoni, O. and Picard, J. (2004). Statistical Learning Theory and Stochastic Optimization: Ecole D’Eté de Probabilités de Saint-Flour, XXXI-2001. Springer, Berlin.
• [13] Chatterjee, S. (2014). A new perspective on least squares under convex constraint. Ann. Statist. 42 2340–2381.
• [14] Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classical estimation. J. Econometrics 115 293–346.
• [15] Dalalyan, A. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72 39–61.
• [16] Dalalyan, A. S. and Salmon, J. (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40 2327–2355.
• [17] De Blasi, P. and Walker, S. G. (2013). Bayesian asymptotics with misspecified models. Statist. Sinica 23 169–187.
• [18] Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 589–607.
• [19] Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statist. Sci. 13 163–185.
• [20] Germain, P., Lacasse, A., Laviolette, F. and Marchand, M. (2009). PAC-Bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning 353–360. ACM, New York.
• [21] Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Amer. Statist. Assoc. 90 909–920.
• [22] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist. 28 500–531.
• [23] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-i.i.d. observations. Ann. Statist. 35 192–223.
• [24] Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist. 35 697–723.
• [25] Grünwald, P. (2012). The safe Bayesian: Learning the learning rate via the mixability gap. In Algorithmic Learning Theory. Lecture Notes in Computer Science 7568 169–183. Springer, Heidelberg.
• [26] Grünwald, P. and van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12 1069–1103.
• [27] Guedj, B. and Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electron. J. Stat. 7 264–291.
• [28] Guntuboyina, A. and Sen, B. (2013). Covering numbers for convex functions. IEEE Trans. Inform. Theory 59 1957–1965.
• [29] Guntuboyina, A. and Sen, B. (2015). Global risk bounds and adaptation in univariate convex regression. Probab. Theory Related Fields 163 379–411.
• [30] Hannah, L. A. and Dunson, D. B. (2013). Multivariate convex regression with adaptive partitioning. J. Mach. Learn. Res. 14 3261–3294.
• [31] Holmes, C. C. and Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104 497–503.
• [32] Jiang, W. and Tanner, M. A. (2008). Gibbs posterior for variable selection in high-dimensional classification and data mining. Ann. Statist. 36 2207–2231.
• [33] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34 837–877.
• [34] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [35] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
• [36] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 443–457. Birkhäuser, Boston, MA.
• [37] Kruijer, W., Rousseau, J. and van der Vaart, A. (2010). Adaptive Bayesian density estimation with location-scale mixtures. Electron. J. Stat. 4 1225–1257.
• [38] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
• [39] LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
• [40] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• [41] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
• [42] Martin, R., Mess, R. and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli 23 1822–1847.
• [43] Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector. Electron. J. Stat. 8 2188–2206.
• [44] Martin, R. and Walker, S. G. (2016). Optimal Bayesian posterior concentration rates with empirical priors. Preprint. Available at arXiv:1604.05734.
• [45] McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998) 230–234. ACM, New York.
• [46] Miller, J. W. and Dunson, D. B. (2015). Robust Bayesian inference via coarsening. Preprint. Available at arXiv:1506.06101.
• [47] Miller, J. W. and Harrison, M. T. (2018). Mixture models with a prior on the number of components. J. Amer. Statist. Assoc. 113 340–356.
• [48] O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. Roy. Statist. Soc. Ser. B 57 99–138.
• [49] Rakhlin, A., Sridharan, K. and Tsybakov, A. B. (2017). Empirical entropy, minimax regret and minimax risk. Bernoulli 23 789–824.
• [50] Ramamoorthi, R. V., Sriram, K. and Martin, R. (2015). On posterior concentration in misspecified models. Bayesian Anal. 10 759–789.
• [51] Rockafellar, R. T. (1997). Convex Analysis. Princeton Univ. Press, Princeton, NJ.
• [52] Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory 2–9. ACM, New York.
• [53] Shen, W., Tokdar, S. T. and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika 100 623–640.
• [54] Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist. 29 687–714.
• [55] Stephens, M. (2016). False discovery rates: A new deal. Biostatistics 18 275–294.
• [56] van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth. Ann. Statist. 37 2655–2675.
• [57] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York.
• [58] van Erven, T. and Harremoës, P. (2014). Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inform. Theory 60 3797–3820.
• [59] van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
• [60] Vapnik, V. N. and Chervonenkis, A. J. (1974). Theory of pattern recognition.
• [61] Walker, S. and Hjort, N. L. (2001). On Bayesian consistency. J. R. Stat. Soc. Ser. B. Stat. Methodol. 63 811–821.
• [62] Zhang, T. (2006). From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist. 34 2180–2210.

#### Supplemental materials

• Proofs of main results. All proofs and additional details pertaining to Section 4 and Section 5 are provided in the supplementary document.