## Statistical Science

### A Unified Framework for High-Dimensional Analysis of $M$-Estimators with Decomposable Regularizers

#### Abstract

High-dimensional statistical inference deals with models in which the the number of parameters $p$ is comparable to or larger than the sample size $n$. Since it is usually impossible to obtain consistent procedures unless $p/n\rightarrow0$, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank matrices and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. This paper provides a unified framework for establishing consistency and convergence rates for such regularized $M$-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results on consistency and convergence rates, in both $\ell_{2}$-error and related norms. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure corresponding regularized $M$-estimators have fast convergence rates and which are optimal in many well-studied cases.

#### Article information

Source
Statist. Sci., Volume 27, Number 4 (2012), 538-557.

Dates
First available in Project Euclid: 21 December 2012

https://projecteuclid.org/euclid.ss/1356098555

Digital Object Identifier
doi:10.1214/12-STS400

Mathematical Reviews number (MathSciNet)
MR3025133

Zentralblatt MATH identifier
1331.62350

#### Citation

Negahban, Sahand N.; Ravikumar, Pradeep; Wainwright, Martin J.; Yu, Bin. A Unified Framework for High-Dimensional Analysis of $M$-Estimators with Decomposable Regularizers. Statist. Sci. 27 (2012), no. 4, 538--557. doi:10.1214/12-STS400. https://projecteuclid.org/euclid.ss/1356098555

#### References

• [1] Agarwal, A., Negahban, S. and Wainwright, M. J. (2011). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann. Statist. 40 1171–1197.
• [2] Bach, F. (2010). Self-concordant analysis for logistic regression. Electron. J. Stat. 4 384–414.
• [3] Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
• [4] Bach, F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res. 9 1019–1048.
• [5] Baraniuk, R. G., Cevher, V., Duarte, M. F. and Hegde, C. (2008). Model-based compressive sensing. Technical report, Rice Univ. Available at arXiv:0808.3572.
• [6] Bickel, P. J., Brown, J. B., Huang, H. and Li, Q. (2009). An overview of recent developments in genomics and associated statistical methods. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4313–4337.
• [7] Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
• [8] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [9] Bunea, F. (2008). Honest variable selection in linear and logistic regression models via $l_{1}$ and $l_{1}+l_{2}$ penalization. Electron. J. Stat. 2 1153–1194.
• [10] Bunea, F., She, Y. and Wegkamp, M. (2010). Adaptive rank penalized estimators in multivariate regression. Technical report, Florida State. Available at arXiv:1004.2995.
• [11] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194.
• [12] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• [13] Cai, T. and Zhou, H. (2010). Optimal rates of convergence for sparse covariance matrix estimation. Technical report, Wharton School of Business, Univ. Pennsylvania. Available at http://www-stat.wharton.upenn.edu/~tcai/paper/html/Sparse-Covariance-Matrix.html.
• [14] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [15] Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772.
• [16] Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215.
• [17] Candes, E. J., X. Li, Y. M. and Wright, J. (2010). Stable principal component pursuit. In IEEE International Symposium on Information Theory, Austin, TX.
• [18] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A. and Willsky, A. S. (2011). Rank-sparsity incoherence for matrix decomposition. SIAM J. Optimiz. 21 572–596.
• [19] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
• [20] Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306.
• [21] Donoho, D. L. and Tanner, J. (2005). Neighborliness of randomly projected simplices in high dimensions. Proc. Natl. Acad. Sci. USA 102 9452–9457 (electronic).
• [22] El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 36 2717–2756.
• [23] Fazel, M. (2002). Matrix rank minimization with applications. Ph.D. thesis, Stanford. Available at http://faculty.washington.edu/mfazel/thesis-final.pdf.
• [24] Girko, V. L. (1995). Statistical Analysis of Observations of Increasing Dimension. Theory and Decision Library. Series B: Mathematical and Statistical Methods 28. Kluwer Academic, Dordrecht. Translated from the Russian.
• [25] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
• [26] Hsu, D., Kakade, S. M. and Zhang, T. (2011). Robust matrix decomposition with sparse corruptions. IEEE Trans. Inform. Theory 57 7221–7234.
• [27] Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• [28] Jacob, L., Obozinski, G. and Vert, J. P. (2009). Group Lasso with overlap and graph Lasso. In International Conference on Machine Learning (ICML) 433–440, Haifa, Israel.
• [29] Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2011). Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12 2297–2334.
• [30] Kakade, S. M., Shamir, O., Sridharan, K. and Tewari, A. (2010). Learning exponential families in high-dimensions: Strong convexity and sparsity. In AISTATS, Sardinia, Italy.
• [31] Keshavan, R. H., Montanari, A. and Oh, S. (2010). Matrix completion from noisy entries. J. Mach. Learn. Res. 11 2057–2078.
• [32] Kim, Y., Kim, J. and Kim, Y. (2006). Blockwise sparse regression. Statist. Sinica 16 375–390.
• [33] Koltchinskii, V. and Yuan, M. (2008). Sparse recovery in large ensembles of kernel machines. In Proceedings of COLT, Helsinki, Finland.
• [34] Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learning. Ann. Statist. 38 3660–3695.
• [35] Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 4254–4278.
• [36] Landgrebe, D. (2008). Hyperspectral image data analsysis as a high-dimensional signal processing problem. IEEE Signal Processing Magazine 19 17–28.
• [37] Lee, K. and Bresler, Y. (2009). Guaranteed minimum rank approximation from linear observations by nuclear norm minimization with an ellipsoidal constraint. Technical report, UIUC. Available at arXiv:0903.4742.
• [38] Liu, Z. and Vandenberghe, L. (2009). Interior-point method for nuclear norm approximation with application to system identification. SIAM J. Matrix Anal. Appl. 31 1235–1256.
• [39] Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. Technical report, ETH Zurich. Available at arXiv:0903.1468.
• [40] Lustig, M., Donoho, D., Santos, J. and Pauly, J. (2008). Compressed sensing MRI. IEEE Signal Processing Magazine 27 72–82.
• [41] McCoy, M. and Tropp, J. (2011). Two proposals for robust PCA using semidefinite programming. Electron. J. Stat. 5 1123–1160.
• [42] Mehta, M. L. (1991). Random Matrices, 2nd ed. Academic Press, Boston, MA.
• [43] Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
• [44] Meinshausen, N. (2008). A note on the Lasso for Gaussian graphical model selection. Statist. Probab. Lett. 78 880–884.
• [45] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [46] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
• [47] Nardi, Y. and Rinaldo, A. (2008). On the asymptotic properties of the group lasso estimator for linear models. Electron. J. Stat. 2 605–633.
• [48] Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. In NIPS Conference, Vancouver, Canada.
• [49] Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). Supplement to “A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers.” DOI:10.1214/12-STS400SUPP.
• [50] Negahban, S. and Wainwright, M. J. (2011). Simultaneous support recovery in high-dimensional regression: Benefits and perils of $\ell _{1,\infty}$-regularization. IEEE Trans. Inform. Theory 57 3481–3863.
• [51] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
• [52] Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and (weighted) matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13 1665–1697.
• [53] Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist. 39 1–47.
• [54] Pastur, L. A. (1972). The spectrum of random matrices. Teoret. Mat. Fiz. 10 102–112.
• [55] Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11 2241–2259.
• [56] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• [57] Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res. 13 389–427.
• [58] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030.
• [59] Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell_{1}$-regularized logistic regression. Ann. Statist. 38 1287–1319.
• [60] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_{1}$-penalized log-determinant divergence. Electron. J. Stat. 5 935–980.
• [61] Recht, B. (2011). A simpler approach to matrix completion. J. Mach. Learn. Res. 12 3413–3430.
• [62] Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501.
• [63] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
• [64] Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515.
• [65] Rudelson, M. and Zhou, S. (2011). Reconstruction from anisotropic random measurements. Technical report, Univ. Michigan.
• [66] Stojnic, M., Parvaresh, F. and Hassibi, B. (2009). On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Trans. Signal Process. 57 3075–3085.
• [67] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
• [68] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
• [69] Tropp, J. A., Gilbert, A. C. and Strauss, M. J. (2006). Algorithms for simultaneous sparse approximation. Signal Process. 86 572–602. Special issue on “Sparse approximations in signal and image processing”.
• [70] Turlach, B. A., Venables, W. N. and Wright, S. J. (2005). Simultaneous variable selection. Technometrics 47 349–363.
• [71] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
• [72] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392.
• [73] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
• [74] Wainwright, M. J. (2009). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inform. Theory 55 5728–5741.
• [75] Wigner, E. P. (1955). Characteristic vectors of bordered matrices with infinite dimensions. Ann. of Math. (2) 62 548–564.
• [76] Xu, H., Caramanis, C. and Sanghavi, S. (2012). Robust PCA via outlier pursuit. IEEE Trans. Inform. Theory 58 3047–3064.
• [77] Yuan, M., Ekici, A., Lu, Z. and Monteiro, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 329–346.
• [78] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
• [79] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
• [80] Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
• [81] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
• [82] Zhou, S., Lafferty, J. and Wasserman, L. (2008). Time-varying undirected graphs. In 21st Annual Conference on Learning Theory (COLT), Helsinki, Finland.

#### Supplemental materials

• Supplementary material: Supplementary material for “A unified framework for high-dimensional analysis of $\boldsymbol{M}$-estimators with decomposable regularizers”. Due to space constraints, the proofs and technical details have been given in the supplementary document by Negahban et al. [49].