## The Annals of Statistics

### Partial least squares prediction in high-dimensional regression

#### Abstract

We study the asymptotic behavior of predictions from partial least squares (PLS) regression as the sample size and number of predictors diverge in various alignments. We show that there is a range of regression scenarios where PLS predictions have the usual root-$n$ convergence rate, even when the sample size is substantially smaller than the number of predictors, and an even wider range where the rate is slower but may still produce practically useful results. We show also that PLS predictions achieve their best asymptotic behavior in abundant regressions where many predictors contribute information about the response. Their asymptotic behavior tends to be undesirable in sparse regressions where few predictors contribute information about the response.

#### Article information

Source
Ann. Statist., Volume 47, Number 2 (2019), 884-908.

Dates
Revised: December 2017
First available in Project Euclid: 11 January 2019

https://projecteuclid.org/euclid.aos/1547197242

Digital Object Identifier
doi:10.1214/18-AOS1681

Mathematical Reviews number (MathSciNet)
MR3909954

Zentralblatt MATH identifier
07033155

Subjects
Primary: 62J05: Linear regression
Secondary: 62F12: Asymptotic properties of estimators

#### Citation

Cook, R. Dennis; Forzani, Liliana. Partial least squares prediction in high-dimensional regression. Ann. Statist. 47 (2019), no. 2, 884--908. doi:10.1214/18-AOS1681. https://projecteuclid.org/euclid.aos/1547197242

#### References

• [1] Abudu, S., King, P. and Pagano, T. C. (2010). Application of partial least-squares regression in seasonal streamflow forecasting. J. Hydrol. Eng. 15 612–623.
• [2] Biancolillo, A., Bucci, R., Magrì, A. L., Magrì, A. D. and Marini, F. (2014). Data-fusion for multiplatform characterization of an Italian craft beer aimed at its authentication. Anal. Chim. Acta 820 23–31.
• [3] Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinform. 8 32–44.
• [4] Bro, R. and Eeldén, L. (2009). PLS works. J. Chemom. 23 69–71.
• [5] Castejòn, D., Garcìa-Segura, J. M., Escudero, R., Herrera, A. and Cambero, M. I. (2015). Metabolomics of meat exudate: Its potential to evaluate beef meat conservation and aging. Anal. Chim. Acta 901 1–11.
• [6] Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 3–25.
• [7] Cook, R. D. (1994). Using dimension-reduction subspaces to identify important inputs in models of physical systems. In Proceedings of the Section on Engineering and Physical Sciences 18–25. Amer. Statist. Assoc., Alexandria, VA.
• [8] Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley, New York.
• [9] Cook, R. D. and Forzani, L. (2017). Big data and partial least squares prediction. Canad. J. Statist. 46 62–78.
• [10] Cook, R. D. and Forzani, L. (2018). Supplement to “Partial least squares prediction in high-dimensional regression.” DOI:10.1214/18-AOS1681SUPP.
• [11] Cook, R. D., Forzani, L. and Rothman, A. J. (2013). Prediction in abundant high-dimensional linear regression. Electron. J. Stat. 7 3059–3088.
• [12] Cook, R. D., Helland, I. S. and Su, Z. (2013). Envelopes and partial least squares regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877.
• [13] Cook, R. D., Li, B. and Chiaromonte, F. (2007). Dimension reduction in regression without matrix inversion. Biometrika 94 569–584.
• [14] Cook, R. D., Li, B. and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica 20 927–960.
• [15] de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 18 251–263.
• [16] Delaigle, A. and Hall, P. (2012). Methodology and theory for partial least squares applied to functional data. Ann. Statist. 40 322–352.
• [17] Eck, D. J. and Cook, R. D. (2017). Weighted envelope estimation to handle variability in model selection. Biometrika 104 743–749.
• [18] Frank, I. E. and Frideman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35 102–246.
• [19] Garthwaite, P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 89 122–127.
• [20] Goicoechea, H. C. and Oliver, A. C. (1999). Enhanced synchronous spectrofluorometric determination of tetracycline in blood serum by chemometric analysis. Comparison of partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368.
• [21] Helland, I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 17 97–114.
• [22] Helland, I. S. (1992). Maximum likelihood regression on relevant components. J. Roy. Statist. Soc. Ser. B 54 637–647.
• [23] Helland, I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. Intell. Lab. Syst. 58 97–107.
• [24] Kandel, T. A., Gislum, R., Jørgensen, U. and Lærke, P. E. (2013). Prediction of biogas yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and chemometrics. Bioresour. Technol. 146 282–287.
• [25] Koch, C., Posch, A. E., Goicoechea, H. C., Herwig, C. and Lendla, B. (2013). Multi-analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by partial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103–110.
• [26] Li, W., Cheng, Z., Wang, Y. and Qu, H. (2013). Quality control of Lonicerae Japonicae Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72 33–39.
• [27] Lobaugh, N. J., West, R. and McIntosh, A. R. (2001). Spatiotemporal analysis of experimental differences in event-related potential data with partial least squares. Psychophysiology 38 517–530.
• [28] Martens, H. and Næs, T. (1992). Multivariate Calibration. Wiley, Chichester.
• [29] Næs, T. and Helland, I. S. (1993). Relevant components in regression. Scand. J. Stat. 20 239–250.
• [30] Naik, P. and Tsai, C.-L. (2000). Partial least squares estimator for single-index models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 763–771.
• [31] Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 39–50.
• [32] Nguyen, D. V. and Rocke, D. M. (2004). On partial least squares dimension reduction for microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–425.
• [33] Schwartz, R. W., Kembhavi, A., Harwood, D. and Davis, L. S. (2009). Human detection using partial least squares analysis. In 2009 IEEE 12th International Conference on Computer Vision 24–31.
• [34] ter Braak, C. J. F. and de Jong, S. (1998). The objective function of partial least squares regression. J. Chemom. 12 41–54.
• [35] Wold, S., Martens, H. and Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils (A. Ruhe and B. Kågström, eds.). Lecture Notes in Math. 973 286–293. Springer, Heidelberg.
• [36] Worsley, K. J. (1997). An overview and some new developments in the statistical analysis of PET and fMRI data. Hum. Brain Mapp. 5 254–258.

#### Supplemental materials

• Supplement to “Partial least squares prediction in high-dimensional regression”. Proofs for all lemmas, propositions and theorems are provided in the online supplement to this article.