The Annals of Applied Statistics

Bi-cross-validation of the SVD and the nonnegative matrix factorization

Art B. Owen and Patrick O. Perry

Source: Ann. Appl. Stat. Volume 3, Number 2 (2009), 564-594.

Abstract

This article presents a form of bi-cross-validation (BCV) for choosing the rank in outer product models, especially the singular value decomposition (SVD) and the nonnegative matrix factorization (NMF). Instead of leaving out a set of rows of the data matrix, we leave out a set of rows and a set of columns, and then predict the left out entries by low rank operations on the retained data. We prove a self-consistency result expressing the prediction error as a residual from a low rank approximation. Random matrix theory and some empirical results suggest that smaller hold-out sets lead to more over-fitting, while larger ones are more prone to under-fitting. In simulated examples we find that a method leaving out half the rows and half the columns performs well.

Keywords: Cross-validation; principal components; random matrix theory; sample reuse; weak factor model

Full-text: Access denied (no subscription detected)

In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1245676186
Digital Object Identifier: doi:10.1214/08-AOAS227
Zentralblatt MATH identifier: 1166.62047

References

Akaike, H. (1974). A new look at the statistical model identification., IEEE Trans. Automat. Control. 19 716–723.
Mathematical Reviews (MathSciNet): MR423716
Digital Object Identifier: doi:10.1109/TAC.1974.1100705
Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling., Proc. Nat. Acad. Sci. U.S.A. 97 10101–10106.
Bai, J. (2003). Inferential theory for factor models of large dimensions., Econometrica 71 135–171.
Mathematical Reviews (MathSciNet): MR1956857
Digital Object Identifier: doi:10.1111/1468-0262.00392
Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models., Econometrica 70 191–221.
Mathematical Reviews (MathSciNet): MR1926259
Digital Object Identifier: doi:10.1111/1468-0262.00273
Baik, J. and Silverstein, J. W. (2004). Eigenvalues of large sample covariance matrices of spiked population models., J. Multivariate Anal. 97 1382–1408.
Ben-Israel, A. and Greville, T. N. E. (2003)., Generalized Inverses: Theory and Applications, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR1987382
Besse, P. and Ferré, L. (1993). Sur l’usage de la validation croisée en analyse en composantes principales., Revue de statistique appliquée 41 71–76.
Mathematical Reviews (MathSciNet): MR1253513
dos S. Dias, C. T. and Krzanowski, W. J. (2003). Model selection and cross validation in additive main effect and multiplicative interaction models., Crop Science 43 865–873.
Eastment, H. T. and Krzanowski, W. J. (1982). Cross-validatory choice of the number of components from a principal component analysis., Technometrics 24 73–77.
Mathematical Reviews (MathSciNet): MR653115
Digital Object Identifier: doi:10.2307/1267581
Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank., Psychometrika 1 211–218.
Gabriel, K. (2002). Le biplot–outil d’exploration de données multidimensionelles., Journal de la Societe Francaise de Statistique 143 5–55.
Golub, G. H. and Van Loan, C. F. (1996)., Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
Mathematical Reviews (MathSciNet): MR1417720
Hansen, P. C. (1987). The truncated SVD as a method for regularization., BIT 27 534–553.
Mathematical Reviews (MathSciNet): MR916729
Digital Object Identifier: doi:10.1007/BF01937276
Hartigan, J. (1975)., Clustering Algorithms. Wiley, New York.
Mathematical Reviews (MathSciNet): MR405726
Zentralblatt MATH: 0372.62040
Hoff, P. D. (2007). Model averaging and dimension selection for the singular value decomposition., J. Amer. Statist. Assoc. 102 674–685.
Mathematical Reviews (MathSciNet): MR2325118
Zentralblatt MATH: 05191536
Digital Object Identifier: doi:10.1198/016214506000001310
Holmes-Junca, S. (1985). Outils informatiques pour l’évaluation de la pertinence d’un résultat en analyse des données. Ph.D. thesis, Université Montpelier, 2.
Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches., Ecology (Durham) 74 2204–2214.
Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis., Ann. Statist. 29 295–327.
Mathematical Reviews (MathSciNet): MR1863961
Zentralblatt MATH: 1016.62078
Digital Object Identifier: doi:10.1214/aos/1009210544
Project Euclid: euclid.aos/1009210544
Jolliffe, I. T. (2002)., Principal Component Analysis, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR2036084
Juvela, M., Lehtinen, K. and Paatero, P. (1994). The use of positive matrix factorization in the analysis of molecular line spectra from the thumbprint nebula., Clouds, Cores, and Low Mass Stars 65 176–180.
Kolda, T. G. and O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval., ACM Transactions on Information Systems 16 322–346.
Laurberg, H., Christensen, M. G., Plumbley, M. D., Hansen, L. K. and Jensen, S. H. (2008). Theorems on positive data: On the uniqueness of NMF., Computational Intelligence and Neuroscience 2008.
Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data., Statist. Sinica 24 61–86.
Mathematical Reviews (MathSciNet): MR1894189
Zentralblatt MATH: 1004.62084
Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization., Nature 401 788–791.
Zentralblatt MATH: 1055.81054
Louwerse, D. J., Smilde, A. K. and Kiers, H. A. L. (1999). Cross-validation of multiway component models., Journal of Chemometrics 13 491–510.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979)., Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet): MR560319
McCullagh, P. (2000). Resampling and exchangeable arrays., Bernoulli 6 285–301.
Mathematical Reviews (MathSciNet): MR1748722
Digital Object Identifier: doi:10.2307/3318577
Project Euclid: euclid.bj/1081788029
Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In, NIPS 2000 598–604.
Muirhead, R. (1982)., Aspects of Multivariate Statistical Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR652932
Zentralblatt MATH: 0556.62028
Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K. and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile analysis., Bioinformatics 19 2088–2096.
Onatski, A. (2007). Asymptotics of the principal components estimator of large factor models with weak factors and i.i.d. Gaussian noise. Technical report, Columbia, Univ.
Owen, A. B. (2007). The pigeonhole bootstrap., Ann. Appl. Statist. 1 386–411.
Mathematical Reviews (MathSciNet): MR2415741
Zentralblatt MATH: 1126.62027
Digital Object Identifier: doi:10.1214/07-AOAS122
Project Euclid: euclid.aoas/1196438024
Porter, M. (1980). An algorithm for suffix stripping., Program 14 130–137.
Rodwell, G., Sonu, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., Mindrinos, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney., PLOS Biology 2 2191–2201.
Schwarz, G. (1978). Estimating the dimension of a model., Ann. Statist. 6 461–464.
Mathematical Reviews (MathSciNet): MR468014
Zentralblatt MATH: 0379.62005
Digital Object Identifier: doi:10.1214/aos/1176344136
Project Euclid: euclid.aos/1176344136
Shao, J. (1997). An asymptotic theory for linear model selection., Statist. Sinica 7 221–264.
Mathematical Reviews (MathSciNet): MR1466682
Zentralblatt MATH: 1003.62527
Soshnikov, A. (2001). A note on universality of the distribution of the largest eigenvalues in certain sampling covariances., J. Statist. Phys. 108 5–6.
Mathematical Reviews (MathSciNet): MR1933444
Zentralblatt MATH: 1018.62042
Digital Object Identifier: doi:10.1023/A:1019739414239
Tian, Y. (2004). On mixed-type reverse-order laws for the Moore–Penrose inverse of a matrix product., Int. J. Math. Math. Sci. 2004 3103–3116.
Mathematical Reviews (MathSciNet): MR2110791
Zentralblatt MATH: 1075.15011
Digital Object Identifier: doi:10.1155/S0161171204301183
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays., Bioinformatics 17 520–525.
Wold, H. (1966). Nonlinear estimation by iterative least squares procedures. In, Research Papers in Statistics: Festschrift for J. Neyman (F. N. David, ed.) 411–444. Wiley, New York.
Mathematical Reviews (MathSciNet): MR210250
Zentralblatt MATH: 0161.15901
Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal components models., Technometrics 20 397–405.

2009 © Institute of Mathematical Statistics