## The Annals of Statistics

### Statistical inference based on robust low-rank data matrix approximation

#### Abstract

The singular value decomposition is widely used to approximate data matrices with lower rank matrices. Feng and He [Ann. Appl. Stat. 3 (2009) 1634–1654] developed tests on dimensionality of the mean structure of a data matrix based on the singular value decomposition. However, the first singular values and vectors can be driven by a small number of outlying measurements. In this paper, we consider a robust alternative that moderates the effect of outliers in low-rank approximations. Under the assumption of random row effects, we provide the asymptotic representations of the robust low-rank approximation. These representations may be used in testing the adequacy of a low-rank approximation. We use oligonucleotide gene microarray data to demonstrate how robust singular value decomposition compares with the its traditional counterparts. Examples show that the robust methods often lead to a more meaningful assessment of the dimensionality of gene intensity data matrices.

#### Article information

Source
Ann. Statist., Volume 42, Number 1 (2014), 190-210.

Dates
First available in Project Euclid: 18 February 2014

https://projecteuclid.org/euclid.aos/1392733185

Digital Object Identifier
doi:10.1214/13-AOS1186

Mathematical Reviews number (MathSciNet)
MR3178461

Zentralblatt MATH identifier
1302.62068

#### Citation

Feng, Xingdong; He, Xuming. Statistical inference based on robust low-rank data matrix approximation. Ann. Statist. 42 (2014), no. 1, 190--210. doi:10.1214/13-AOS1186. https://projecteuclid.org/euclid.aos/1392733185

#### References

• Agarwal, A., Negahban, S. and Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann. Statist. 40 1171–1197.
• Ammann, L. P. (1993). Robust singular value decompositions: A new approach to projection pursuit. J. Amer. Statist. Assoc. 88 505–514.
• Bolstad, B. M., Irizarry, R. A., Astrand, M. and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19 185–193.
• Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37.
• Chen, C., He, X. and Wei, Y. (2008). Lower rank approximation of matrices based on fast and robust alternating regression. J. Comput. Graph. Statist. 17 186–200.
• Feng, X. and He, X. (2009). Inference on low-rank data matrices with applications to microarray data. Ann. Appl. Stat. 3 1634–1654.
• Feng, X. and He, X. (2014). Supplement to “Statistical inference based on robust low-rank data matrix approximation.” DOI:10.1214/13-AOS1186SUPP.
• Gabriel, K. R. and Zamir, S. (1979). Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 21 489–498.
• Gervini, D. and Yohai, V. J. (2002). A class of robust and fully efficient regression estimators. Ann. Statist. 30 583–616.
• Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions, 1st ed. Wiley, New York.
• He, X. and Shao, Q.-M. (1996). A general Bahadur representation of $M$-estimators and its application to linear regression with nonstochastic designs. Ann. Statist. 24 2608–2630.
• Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, 2nd ed. Wiley, Hoboken, NJ.
• Irizarry, R., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, T. P. (2003). A model-based background adjustment for oligonucleotide expression arrays. Nucleic Acids Res. 31 e15.
• Li, C. and Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: Expression index and outlier detection. Proc. Natl. Acad. Sci. USA 98 31–36.
• Lin, G., He, X., Ji, H., Shi, L., Davis, R. W. and Zhong, S. (2013). Reproducibility probability score–incorporating measurement variability across laboratories for gene selection. Nat. Biotechnol. 24 1476–1477.
• Mammen, E. (1991). When Does Bootstrap Work? Asymptotic Results and Simulations, 1st ed. Springer, New York.
• Nandakumar, R., Yu, F., Li, H. and Stout, W. (1998). Assessing unidimensionality of polytomous data. Appl. Psychol. Meas. 22 99–115.
• Ruppert, D. and Carroll, R. J. (1980). Trimmed least squares estimation in the linear model. J. Amer. Statist. Assoc. 75 828–838.
• Shi, L., Reid, L. H., Jones, W. D. et al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24 1151–1161.
• Zhou, Z., Li, X., Wright, J., Candes, E. and Ma, Y. (2009). Stable principal component pursuit. In International Symposium on Information Theory, June 2010.

#### Supplemental materials

• Supplementary material: Additional details of case study and technical proofs. We provide details of the case study in Section 4.3 and complete the proofs of technical lemmas, as well as Theorems 3.1–3.2 and 4.2 of this paper.