The Annals of Applied Statistics

Nonparametric Bayesian sparse factor models with application to gene expression modeling

David Knowles and Zoubin Ghahramani

Full-text: Open access


A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

Article information

Ann. Appl. Stat., Volume 5, Number 2B (2011), 1534-1552.

First available in Project Euclid: 13 July 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Nonparametric Bayes sparsity factor analysis Markov chain Monte Carlo Indian buffet process


Knowles, David; Ghahramani, Zoubin. Nonparametric Bayesian sparse factor models with application to gene expression modeling. Ann. Appl. Stat. 5 (2011), no. 2B, 1534--1552. doi:10.1214/10-AOAS435.

Export citation


  • Archambeau, C. and Bach, F. (2009). Sparse probabilistic projections. In Proceedings of the Conference on Neural Information Processing Systems (NIPS) ( D. Koller, D. Schuurmans, Y. Bengio and L. Bottou, eds.) 73–80. MIT Press, Cambridge, MA.
  • Courville, A. C., Eck, D. and Bengio, Y. (2009). An infinite factor model hierarchy via a noisy-or mechanism. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA.
  • Doshi-Velez, F. and Ghahramani, Z. (2009). Correlated non-parametric latent feature models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence 143–150. AUAI Press, Arlington, VA.
  • Fevotte, C. and Godsill, S. J. (2006). A Bayesian approach for blind separation of sparse sources. IEEE Transactions on Audio, Speech, and Language Processing 14 2174–2188.
  • Fokoue, E. (2004). Stochastic determination of the intrinsic structure in Bayesian factor analysis. Technical Report No. 17, Statistical and Applied Mathematical Sciences Institute.
  • Griffiths, T. L. and Ghahramani, Z. (2006). Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA.
  • Kao, K. C., Yang, Y.-L., Boscolo, R., Sabatti, C., Roychowdhury, V. and Liao, J. C. (2004). Transcriptome-based determination of multiple transcription regulator activities in Escherichia coli by using network component analysis. In Proceedings of the National Academy of Sciences of the United States of America (PNAS) 101 641–646. Natl. Acad. Sci., Washington, DC.
  • Kaufman, G. M. and Press, S. J. (1973). Bayesian factor analysis. Technical Report No. 662-73, Sloan School of Management, Univ. Chicago.
  • Knowles, D. and Ghahramani, Z. (2007). Infinite sparse factor analysis and infinite independent components analysis. In 7th International Conference on Independent Component Analysis and Signal Separation 381–388. Springer, Berlin.
  • Meeds, E., Ghahramani, Z., Neal, R. and Roweis, S. (2006). Modeling dyadic data with binary latent factors. In Neural Information Processing Systems 19. MIT Press, Cambridge, MA.
  • Rai, P. and Daumé III, H. (2008). The infinite hierarchical factor regression model. In Neural Information Processing Systems. MIT Press, Cambridge, MA.
  • Rowe, D. B. and Press, S. J. (1998). Gibbs sampling and hill climbing in Bayesian factor analysis. Technical Report No. 255, Dept. Statistics, Univ. California, Riverside.
  • West, M., Chang, J., Lucas, J., Nevins, J. R., Wang, Q. and Carvalho, C. (2007). High-dimensional sparse factor modelling: Applications in gene expression genomics. Technical report, ISDS, Duke Univ.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
  • Yu, Y. P., Landsittel, D., Jing, L., Nelson, J., Ren, B., Liu, L., McDonald, C., Thomas, R., Dhir, R., Finkelstein, S., Michalopoulos, G., Becich, M. and Luo, J.-H. (2004). Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. Journal of Clinical Oncology 22 2790–2799.
  • Zhang, Z., Chan, K. L., Kwok, J. T. and yan Yeung, D. (2004). Bayesian inference on principal component analysis using reversible jump Markov Chain Monte Carlo. In Proceedings of the 19th National Conference on Artificial Intelligence, San Jose, California, USA 372–377. AAAI Press.
  • Zou, H., Hastie, T. and Tibshirani, R. (2004). Sparse principal component analysis. J. Comput. Graph. Statist. 15 2006.

Supplemental materials

  • Supplementary material: Graphs of precision and recall for the synthetic data experiment. The precision and recall of active elements of the Z matrix achieved by each algorithm (after thresholding for the nonsparse algorithms) on the synthetic data experiment, described in Section 5.1. The results are consistent with the reconstruction error.