## The Annals of Statistics

### Randomized sketches for kernels: Fast and optimal nonparametric regression

#### Abstract

Kernel ridge regression (KRR) is a standard method for performing nonparametric regression over reproducing kernel Hilbert spaces. Given $n$ samples, the time and space complexity of computing the KRR estimate scale as $\mathcal{O}(n^{3})$ and $\mathcal{O}(n^{2})$, respectively, and so is prohibitive in many cases. We propose approximations of KRR based on $m$-dimensional randomized sketches of the kernel matrix, and study how small the projection dimension $m$ can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension $m$ proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for nonparametric regression. In doing so, we prove a novel lower bound on the minimax risk of kernel regression in terms of the localized Rademacher complexity.

#### Article information

Source
Ann. Statist., Volume 45, Number 3 (2017), 991-1023.

Dates
Revised: April 2016
First available in Project Euclid: 13 June 2017

https://projecteuclid.org/euclid.aos/1497319686

Digital Object Identifier
doi:10.1214/16-AOS1472

Mathematical Reviews number (MathSciNet)
MR3662446

Zentralblatt MATH identifier
1371.62039

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 68W20: Randomized algorithms

#### Citation

Yang, Yun; Pilanci, Mert; Wainwright, Martin J. Randomized sketches for kernels: Fast and optimal nonparametric regression. Ann. Statist. 45 (2017), no. 3, 991--1023. doi:10.1214/16-AOS1472. https://projecteuclid.org/euclid.aos/1497319686

#### References

• [1] Ailon, N. and Liberty, E. (2009). Fast dimension reduction using Rademacher series on dual BCH codes. Discrete Comput. Geom. 42 615–630.
• [2] Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. Technical Report, UC Berkeley. Available at arXiv:1411.0306.
• [3] Amelunxen, D., Lotz, M., McCoy, M. B. and Tropp, J. A. (2014). Living on the edge: Phase transitions in convex programs with random data. Inf. Inference 3 224–294.
• [4] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
• [5] Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. In International Conference on Learning Theory (COLT). Edinburgh.
• [6] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• [7] Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic, Boston, MA.
• [8] Boutsidis, C. and Gittens, A. (2013). Improved matrix algorithms via the subsampled randomized Hadamard transform. SIAM J. Matrix Anal. Appl. 34 1301–1340.
• [9] Cohen, M. B., Nelson, J. and Woodruff, D. P. (2015). Optimal approximate matrix product in terms of stable rank. Technical report.
• [10] Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6 2153–2175.
• [11] Gittens, A. and Mahoney, M. W. (2013). Revisiting the nystrom method for improved large-scale machine learning. Preprint. Available at arXiv:1303.1849.
• [12] Gu, C. (2002). Smoothing Spline ANOVA Models. Springer, New York.
• [13] Halko, N., Martinsson, P. G. and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53 217–288.
• [14] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
• [15] Johnstone, I. M. (2016). Gaussian Estimation: Sequence and Wavelet Models. Springer, New York.
• [16] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82–95.
• [17] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [18] Krahmer, F. and Ward, R. (2011). New and improved Johnson–Lindenstrauss embeddings via the restricted isometry property. SIAM J. Math. Anal. 43 1269–1281.
• [19] Ledoux, M. (2001). The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs 89. Amer. Math. Soc., Providence, RI.
• [20] Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Faund. Trends Mach. Learn. 3 123–224.
• [21] Matoušek, J. (2002). Lectures on Discrete Geometry. Graduate Texts in Mathematics 212. Springer, New York.
• [22] Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 29–43. Springer, Berlin.
• [23] Pilanci, M. and Wainwright, M. J. (2015). Randomized sketches of convex programs with sharp guarantees. IEEE Trans. Inform. Theory 61 5096–5115.
• [24] Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17. Paper No. 53, 38.
• [25] Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res. 13 389–427.
• [26] Sarlós, T. (2006). Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS ’06. 47th Annual IEEE Symposium. Berkeley, CA.
• [27] Saunders, C., Gammerman, A. and Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98 515–521. Morgan Kaufmann, San Francisco, CA.
• [28] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge.
• [29] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053.
• [30] Tropp, J. A. (2011). Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal. 3 115–126.
• [31] Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389–434.
• [32] van de Geer, S. A. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
• [33] Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. SIAM, Philadelphia, PA.
• [34] Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems 682–688. Vancouver, BC, Canada.
• [35] Zhang, Y., Duchi, J. C. and Wainwright, M. J. (2013). Divide and conquer kernel ridge regression. In Computational Learning Theory (COLT) Conference. Princeton, NJ.