## The Annals of Statistics

### Optimal cross-validation in density estimation with the $L^{2}$-loss

Alain Celisse

#### Abstract

We analyze the performance of cross-validation (CV) in the density estimation framework with two purposes: (i) risk estimation and (ii) model selection. The main focus is given to the so-called leave-$p$-out CV procedure (Lpo), where $p$ denotes the cardinality of the test set. Closed-form expressions are settled for the Lpo estimator of the risk of projection estimators. These expressions provide a great improvement upon $V$-fold cross-validation in terms of variability and computational complexity.

From a theoretical point of view, closed-form expressions also enable to study the Lpo performance in terms of risk estimation. The optimality of leave-one-out (Loo), that is Lpo with $p=1$, is proved among CV procedures used for risk estimation. Two model selection frameworks are also considered: estimation, as opposed to identification. For estimation with finite sample size $n$, optimality is achieved for $p$ large enough [with $p/n=o(1)$] to balance the overfitting resulting from the structure of the model collection. For identification, model selection consistency is settled for Lpo as long as $p/n$ is conveniently related to the rate of convergence of the best estimator in the collection: (i) $p/n\to1$ as $n\to+\infty$ with a parametric rate, and (ii) $p/n=o(1)$ with some nonparametric estimators. These theoretical results are validated by simulation experiments.

#### Article information

Source
Ann. Statist., Volume 42, Number 5 (2014), 1879-1910.

Dates
First available in Project Euclid: 11 September 2014

https://projecteuclid.org/euclid.aos/1410440628

Digital Object Identifier
doi:10.1214/14-AOS1240

Mathematical Reviews number (MathSciNet)
MR3262471

Zentralblatt MATH identifier
1305.62179

Subjects
Primary: 62G09: Resampling methods
Secondary: 62G07: Density estimation 62E17: Approximations to distributions (nonasymptotic)

#### Citation

Celisse, Alain. Optimal cross-validation in density estimation with the $L^{2}$-loss. Ann. Statist. 42 (2014), no. 5, 1879--1910. doi:10.1214/14-AOS1240. https://projecteuclid.org/euclid.aos/1410440628

#### References

• Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) 267–281. Akadémiai Kiadó, Budapest.
• Arlot, S. (2007). $V$-fold penalization: An alternative to $v$-fold cross-validation. In Oberwolfach Reports, Volume 4 of Mathematisches Forschungsinstitut. EMS, Zürich.
• Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv. 4 40–79.
• Arlot, S. and Celisse, A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation. Stat. Comput. 21 613–632.
• Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. Journal of Machine Learning 10 245–279.
• Baraud, Y., Giraud, C. and Huet, S. (2009). Gaussian model selection with an unknown variance. Ann. Statist. 37 630–672.
• Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
• Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory 37 1034–1054.
• Bartlett, P., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85–113.
• Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam (D. Pollard, E. Torgensen and G. Yang, eds.) 55–87. Springer, New York.
• Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
• Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33–73.
• Birgé, L. and Rozenholc, Y. (2006). How many bins should be put in a regular histogram. ESAIM Probab. Stat. 10 24–45 (electronic).
• Blanchard, G. and Massart, P. (2006). Discussion: “Local Rademacher complexities and oracle inequalities in risk minimization” [Ann. Statist. 34 (2006) 2593–2656] by V. Koltchinskii. Ann. Statist. 34 2664–2671.
• Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71 353–360.
• Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
• Burman, P. (1989). A comparative study of ordinary cross-validation, $v$-fold cross-validation and the repeated learning-testing methods. Biometrika 76 503–514.
• Burman, P. (1990). Estimation of optimal transformations using $v$-fold cross validation and repeated learning-testing methods. Sankhyā Ser. A 52 314–345.
• Castellan, G. (1999). Modified Akaike’s criterion for histogram density estimation. Technical Report 99.61, Univ. Paris-Sud.
• Castellan, G. (2003). Density estimation via exponential model selection. IEEE Trans. Inform. Theory 49 2052–2060.
• Celisse, A. (2014). Supplement to “Optimal cross-validation in density estimation with the $L^2$-loss.” DOI:10.1214/14-AOS1240SUPP.
• Celisse, A. and Robin, S. (2008). Nonparametric density estimation by exact leave-$p$-out cross-validation. Comput. Statist. Data Anal. 52 2350–2368.
• DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation. Grundlehren der Mathematischen Wissenschaften 303. Springer, Berlin.
• Geisser, S. (1974). A predictive approach to the random effect model. Biometrika 61 101–107.
• Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc. 70 320–328.
• Ibragimov, I. A. and Has’minskiĭ, R. Z. (1981). Statistical Estimation: Asymptotic Theory. Applications of Mathematics 16. Springer, Berlin.
• Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. J. Educ. Psychol. 22 45–55.
• Ledoux, M. (2001). The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs 89. Amer. Math. Soc., Providence, RI.
• Li, K.-C. (1987). Asymptotic optimality for $C_p$, $C_L$, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 958–975.
• Lugosi, G. and Nobel, A. B. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830–1864.
• Mallows, C. L. (1973). Some comments on $C_p$. Technometrics 15 661–675.
• Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Handbook of Social Psychology, Vol. 2 (G. Lindzey and E. Aronson, eds.). Addison-Wesley, New York.
• Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J. Stat. 9 65–78.
• Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
• Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221–264.
• Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Stat. Methodol. 36 111–147.
• Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 44–47.
• Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist. 12 1285–1297.
• Stone, C. J. (1985). An asymptotically optimal histogram selection rule. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, CA, 1983). Wadsworth Statist./Probab. Ser. 513–520. Wadsworth, Belmont, CA.
• Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
• Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist. 31 252–273.
• Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92 937–950.
• Yang, Y. (2006). Comparing learning methods for classification. Statist. Sinica 16 635–657.
• Yang, Y. (2007). Consistency of cross validation for comparing regression procedures. Ann. Statist. 35 2450–2473.
• Yang, Y. and Barron, A. R. (1998). An asymptotic property of model selection criteria. IEEE Trans. Inform. Theory 44 95–116.
• Zhang, P. (1993). Model selection via multifold cross validation. Ann. Statist. 21 299–313.

#### Supplemental materials

• Supplementary material: Supplement to “Optimal cross-validation in density estimation with the $L^{2}$-loss”: Technical proofs and details. Owing to space constraints, we have moved technical proofs to a supplementary document [Celisse (2014)].