This article presents a form of bi-cross-validation (BCV) for choosing the rank in outer product models, especially the singular value decomposition (SVD) and the nonnegative matrix factorization (NMF). Instead of leaving out a set of rows of the data matrix, we leave out a set of rows and a set of columns, and then predict the left out entries by low rank operations on the retained data. We prove a self-consistency result expressing the prediction error as a residual from a low rank approximation. Random matrix theory and some empirical results suggest that smaller hold-out sets lead to more over-fitting, while larger ones are more prone to under-fitting. In simulated examples we find that a method leaving out half the rows and half the columns performs well.
References
Akaike, H. (1974). A new look at the statistical model identification., IEEE Trans. Automat. Control. 19 716–723.
Mathematical Reviews (MathSciNet):
MR423716
Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling., Proc. Nat. Acad. Sci. U.S.A. 97 10101–10106.
Bai, J. (2003). Inferential theory for factor models of large dimensions., Econometrica 71 135–171.
Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models., Econometrica 70 191–221.
Baik, J. and Silverstein, J. W. (2004). Eigenvalues of large sample covariance matrices of spiked population models., J. Multivariate Anal. 97 1382–1408.
Ben-Israel, A. and Greville, T. N. E. (2003)., Generalized Inverses: Theory and Applications, 2nd ed. Springer, New York.
Besse, P. and Ferré, L. (1993). Sur l’usage de la validation croisée en analyse en composantes principales., Revue de statistique appliquée 41 71–76.
dos S. Dias, C. T. and Krzanowski, W. J. (2003). Model selection and cross validation in additive main effect and multiplicative interaction models., Crop Science 43 865–873.
Eastment, H. T. and Krzanowski, W. J. (1982). Cross-validatory choice of the number of components from a principal component analysis., Technometrics 24 73–77.
Mathematical Reviews (MathSciNet):
MR653115
Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank., Psychometrika 1 211–218.
Gabriel, K. (2002). Le biplot–outil d’exploration de données multidimensionelles., Journal de la Societe Francaise de Statistique 143 5–55.
Golub, G. H. and Van Loan, C. F. (1996)., Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
Hansen, P. C. (1987). The truncated SVD as a method for regularization., BIT 27 534–553.
Mathematical Reviews (MathSciNet):
MR916729
Hartigan, J. (1975)., Clustering Algorithms. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR405726
Hoff, P. D. (2007). Model averaging and dimension selection for the singular value decomposition., J. Amer. Statist. Assoc. 102 674–685.
Holmes-Junca, S. (1985). Outils informatiques pour l’évaluation de la pertinence d’un résultat en analyse des données. Ph.D. thesis, Université Montpelier, 2.
Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches., Ecology (Durham) 74 2204–2214.
Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis., Ann. Statist. 29 295–327.
Jolliffe, I. T. (2002)., Principal Component Analysis, 2nd ed. Springer, New York.
Juvela, M., Lehtinen, K. and Paatero, P. (1994). The use of positive matrix factorization in the analysis of molecular line spectra from the thumbprint nebula., Clouds, Cores, and Low Mass Stars 65 176–180.
Kolda, T. G. and O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval., ACM Transactions on Information Systems 16 322–346.
Laurberg, H., Christensen, M. G., Plumbley, M. D., Hansen, L. K. and Jensen, S. H. (2008). Theorems on positive data: On the uniqueness of NMF., Computational Intelligence and Neuroscience 2008.
Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data., Statist. Sinica 24 61–86.
Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization., Nature 401 788–791.
Louwerse, D. J., Smilde, A. K. and Kiers, H. A. L. (1999). Cross-validation of multiway component models., Journal of Chemometrics 13 491–510.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979)., Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet):
MR560319
McCullagh, P. (2000). Resampling and exchangeable arrays., Bernoulli 6 285–301.
Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In, NIPS 2000 598–604.
Muirhead, R. (1982)., Aspects of Multivariate Statistical Theory. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR652932
Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K. and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile analysis., Bioinformatics 19 2088–2096.
Onatski, A. (2007). Asymptotics of the principal components estimator of large factor models with weak factors and i.i.d. Gaussian noise. Technical report, Columbia, Univ.
Owen, A. B. (2007). The pigeonhole bootstrap., Ann. Appl. Statist. 1 386–411.
Porter, M. (1980). An algorithm for suffix stripping., Program 14 130–137.
Rodwell, G., Sonu, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., Mindrinos, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney., PLOS Biology 2 2191–2201.
Schwarz, G. (1978). Estimating the dimension of a model., Ann. Statist. 6 461–464.
Mathematical Reviews (MathSciNet):
MR468014
Shao, J. (1997). An asymptotic theory for linear model selection., Statist. Sinica 7 221–264.
Soshnikov, A. (2001). A note on universality of the distribution of the largest eigenvalues in certain sampling covariances., J. Statist. Phys. 108 5–6.
Tian, Y. (2004). On mixed-type reverse-order laws for the Moore–Penrose inverse of a matrix product., Int. J. Math. Math. Sci. 2004 3103–3116.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays., Bioinformatics 17 520–525.
Wold, H. (1966). Nonlinear estimation by iterative least squares procedures. In, Research Papers in Statistics: Festschrift for J. Neyman (F. N. David, ed.) 411–444. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR210250
Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal components models., Technometrics 20 397–405.