## The Annals of Statistics

### Fast global convergence of gradient methods for high-dimensional statistical recovery

#### Abstract

Many statistical $M$-estimators are based on convex optimization problems formed by the combination of a data-dependent loss function with a norm-based regularizer. We analyze the convergence rates of projected gradient and composite gradient methods for solving such problems, working within a high-dimensional framework that allows the ambient dimension $d$ to grow with (and possibly exceed) the sample size $n$. Our theory identifies conditions under which projected gradient descent enjoys globally linear convergence up to the statistical precision of the model, meaning the typical distance between the true unknown parameter $\theta^{*}$ and an optimal solution $\widehat{\theta}$. By establishing these conditions with high probability for numerous statistical models, our analysis applies to a wide range of $M$-estimators, including sparse linear regression using Lasso; group Lasso for block sparsity; log-linear models with regularization; low-rank matrix recovery using nuclear norm regularization; and matrix decomposition using a combination of the nuclear and $\ell_{1}$ norms. Overall, our analysis reveals interesting connections between statistical and computational efficiency in high-dimensional estimation.

#### Article information

Source
Ann. Statist., Volume 40, Number 5 (2012), 2452-2482.

Dates
First available in Project Euclid: 4 February 2013

https://projecteuclid.org/euclid.aos/1359987527

Digital Object Identifier
doi:10.1214/12-AOS1032

Mathematical Reviews number (MathSciNet)
MR3097609

Zentralblatt MATH identifier
1373.62244

Subjects
Primary: 62F30: Inference under constraints 62F30: Inference under constraints
Secondary: 62H12: Estimation

#### Citation

Agarwal, Alekh; Negahban, Sahand; Wainwright, Martin J. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist. 40 (2012), no. 5, 2452--2482. doi:10.1214/12-AOS1032. https://projecteuclid.org/euclid.aos/1359987527

#### References

• [1] Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Supplement to “Fast global convergence of gradient methods for high-dimensional statistical recovery.” DOI:10.1214/12-AOS1032SUPP.
• [2] Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann. Statist. 40 1171–1197.
• [3] Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921.
• [4] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
• [5] Becker, S., Bobin, J. and Candès, E. J. (2011). Nesta: A fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4 1–39.
• [6] Bertsekas, D. P. (1995). Nonlinear Programming. Athena Scientific, Belmont, MA.
• [7] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [8] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
• [9] Bredies, K. and Lorenz, D. A. (2008). Linear convergence of iterative soft-thresholding. J. Fourier Anal. Appl. 14 813–837.
• [10] Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37.
• [11] Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772.
• [12] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A. and Willsky, A. S. (2011). Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21 572–596.
• [13] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
• [14] Duchi, J., Shalev-Shwartz, S., Singer, Y. and Chandra, T. (2008). Efficient projections onto the $\ell _1$-ball for learning in high dimensions. In ICML. Omnipress, Helsinki, Finland.
• [15] Fazel, M. (2002). Matrix rank minimization with applications. Ph.D. thesis, Stanford. Available at http://faculty.washington.edu/mfazel/thesis-final.pdf.
• [16] Garg, R. and Khandekar, R. (2009). Gradient descent with sparsification: An iterative algorithm for sparse recovery with restricted isometry property. In ICML. Omnipress, Montreal, Canada.
• [17] Hale, E. T., Yin, W. and Zhang, Y. (2008). Fixed-point continuation for $l_1$-minimization: Methodology and convergence. SIAM J. Optim. 19 1107–1130.
• [18] Hsu, D., Kakade, S. M. and Zhang, T. (2011). Robust matrix decomposition with sparse corruptions. IEEE Trans. Inform. Theory 57 7221–7234.
• [19] Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• [20] Ji, S. and Ye, J. (2009). An accelerated gradient method for trace norm minimization. In ICML. Omnipress, Montreal, Canada.
• [21] Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
• [22] Lee, K. and Bresler, Y. (2009). Guaranteed minimum rank approximation from linear observations by nuclear norm minimization with an ellipsoidal constraint. Technical report, UIUC. Available at arXiv:0903.4742.
• [23] Loh, P. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Ann. Statist. 40 1637–1664.
• [24] Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In COLT. Omnipress, Montreal, Canada.
• [25] Luo, Z.-Q. and Tseng, P. (1993). Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 46/47 157–178.
• [26] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [27] Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. In NIPS. MIT Press, Vancouver, Canada.
• [28] Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
• [29] Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and (weighted) matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13 1665–1697.
• [30] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization 87. Kluwer Academic, Boston, MA.
• [31] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain (UCL).
• [32] Ngai, H. V. and Penot, J.-P. (2008). Paraconvex functions and paraconvex sets. Studia Math. 184 1–29.
• [33] Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11 2241–2259.
• [34] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• [35] Recht, B. (2011). A simpler approach to matrix completion. J. Mach. Learn. Res. 12 3413–3430.
• [36] Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501.
• [37] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
• [38] Rudelson, M. and Zhou, S. (2011). Reconstruction from anisotropic random measurements. Technical report, Univ. Michigan.
• [39] Srebro, N., Alon, N. and Jaakkola, T. S. (2005). Generalization error bounds for collaborative prediction with low-rank matrices. In NIPS. MIT Press, Vancouver, Canada.
• [40] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [41] Tropp, J. A. and Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inform. Theory 53 4655–4666.
• [42] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392.
• [43] Xu, H., Caramanis, C. and Sanghavi, S. (2012). Robust PCA via outlier pursuit. IEEE Trans. Inform. Theory 58 3047–3064.
• [44] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
• [45] Zhao, P., Rocha, G. and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Ann. Statist. 37 3468–3497.