## The Annals of Statistics

### Detection and feature selection in sparse mixture models

#### Abstract

We consider Gaussian mixture models in high dimensions, focusing on the twin tasks of detection and feature selection. Under sparsity assumptions on the difference in means, we derive minimax rates for the problems of testing and of variable selection. We find these rates to depend crucially on the knowledge of the covariance matrices and on whether the mixture is symmetric or not. We establish the performance of various procedures, including the top sparse eigenvalue of the sample covariance matrix (popular in the context of Sparse PCA), as well as new tests inspired by the normality tests of Malkovich and Afifi [J. Amer. Statist. Assoc. 68 (1973) 176–179].

#### Article information

Source
Ann. Statist., Volume 45, Number 5 (2017), 1920-1950.

Dates
Revised: December 2015
First available in Project Euclid: 31 October 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1509436823

Digital Object Identifier
doi:10.1214/16-AOS1513

Mathematical Reviews number (MathSciNet)
MR3718157

Zentralblatt MATH identifier
06821114

#### Citation

Verzelen, Nicolas; Arias-Castro, Ery. Detection and feature selection in sparse mixture models. Ann. Statist. 45 (2017), no. 5, 1920--1950. doi:10.1214/16-AOS1513. https://projecteuclid.org/euclid.aos/1509436823

#### References

• Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Control AC-19 716–723.
• Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
• Azizyan, M., Singh, A. and Wasserman, L. (2013). Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. Neural Information Processing Systems (NIPS).
• Azizyan, M., Singh, A. and Wasserman, L. (2015). Efficient sparse clustering of high-dimensional non-spherical Gaussian mixtures. In AISTATS.
• Belkin, M. and Sinha, K. (2010). Polynomial learning of distribution families. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science FOCS 2010 103–112. IEEE Computer Soc., Los Alamitos, CA.
• Berthet, Q. and Rigollet, P. (2013a). Optimal detection of sparse principal components in high dimension. Ann. Statist. 41 1780–1815.
• Berthet, Q. and Rigollet, P. (2013b). Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory (COLT) 1046–1066.
• Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
• Birnbaum, A., Johnstone, I. M., Nadler, B. and Paul, D. (2013). Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084.
• Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab. 33 514–560.
• Brubaker, S. C. and Vempala, S. S. (2008). Isotropic PCA and affine-invariant clustering. In Building Bridges. Bolyai Soc. Math. Stud. 19 241–281. Springer, Berlin.
• Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 41 3074–3110.
• Cai, T., Ma, Z. and Wu, Y. (2015). Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields 161 781–815.
• Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• Chan, Y. and Hall, P. (2010). Using evidence of mixed populations to select variables for clustering very high-dimensional data. J. Amer. Statist. Assoc. 105 798–809.
• Chang, W.-C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. J. Roy. Statist. Soc. Ser. C 32 267–275.
• Chaudhuri, K., Dasgupta, S. and Vattani, A. (1999). Learning mixtures of Gaussians using the k-means algorithm. Preprint. Available at arXiv:0912.0086.
• Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
• d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. G. (2007). A direct formulation for sparse pca using semidefinite programming. SIAM Rev. 49 434–448.
• Donoho, D. and Jin, J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4449–4470.
• Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961.
• Friedman, J. H. and Meulman, J. J. (2004). Clustering objects on subsets of attributes. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 815–849.
• Hardt, M. and Price, E. (2015). Tight bounds for learning a mixture of two Gaussians [extended abstract]. In STOC’15—Proceedings of the 2015 ACM Symposium on Theory of Computing 753–760. ACM, New York.
• Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• Hsu, D. and Kakade, S. M. (2013). Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS’13—Proceedings of the 2013 ACM Conference on Innovations in Theoretical Computer Science 11–19. ACM, New York.
• Ingster, Y. I., Pouet, C. and Tsybakov, A. B. (2009). Classification of sparse high-dimensional vectors. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4427–4448.
• Ji, P. and Jin, J. (2012). UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Statist. 40 73–103.
• Jin, J. (2009). Impossibility of successful classification when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 106 8859–8864.
• Jin, J., Ke, Z. T. and Wang, W. (2015). Phase transitions for high dimensional clustering and related problems. Preprint. Available at arXiv:1502.06952.
• Jin, J. and Wang, W. (2014). Important feature pca for high dimensional clustering. Preprint. Available at arXiv:1407.5241.
• Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104 682–693.
• Kalai, A. T., Moitra, A. and Valiant, G. (2012). Disentangling Gaussians. Commun. ACM 55 113–120.
• Malkovich, J. F. and Afifi, A. (1973). On tests for multivariate normality. J. Amer. Statist. Assoc. 68 176–179.
• Mallat, S. and Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Trans. Image Process. 41 3397–3415.
• Mallows, C. (1973). Some comments on cp. Technometrics 15 661–675.
• Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika 57 519–530.
• Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
• Maugis, C. and Michel, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM Probab. Stat. 15 41–68.
• Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8 1145–1164.
• Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101 168–178.
• Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• Srivastava, M. S. (1984). A measure of skewness and kurtosis and a graphical method for assessing multivariate normality. Statist. Probab. Lett. 2 263–267.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory 50 2231–2242.
• Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
• van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer, New York.
• Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electron. J. Stat. 6 38–90.
• Verzelen, N. and Arias-Castro, E. (2016). Supplement to “Detection and feature selection in sparse mixture models.” DOI:10.1214/16-AOS1513SUPP.
• Vu, V. Q. and Lei, J. (2012). Minimax rates of estimation for sparse pca in high dimensions. In International Conference on Artificial Intelligence and Statistics 1278–1286.
• Vu, V. Q. and Lei, J. (2013). Minimax sparse principal subspace estimation in high dimensions. Ann. Statist. 41 2905–2947.
• Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64 440–448, 666.
• Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering. J. Amer. Statist. Assoc. 105 713–726.
• Xie, B., Pan, W. and Shen, X. (2008). Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron. J. Stat. 2 168–212.
• Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics 5 427–443.

#### Supplemental materials

• Supplement to “Detection and feature selection in sparse mixture models”. This supplement contains the proofs of the results.