The Annals of Applied Statistics

Latent protein trees

Ricardo Henao, J. Will Thompson, M. Arthur Moseley, Geoffrey S. Ginsburg, Lawrence Carin, and Joseph E. Lucas

Full-text: Open access


Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.

Article information

Ann. Appl. Stat., Volume 7, Number 2 (2013), 691-713.

First available in Project Euclid: 27 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Proteomics data hierarchical factor model coalescent


Henao, Ricardo; Thompson, J. Will; Moseley, M. Arthur; Ginsburg, Geoffrey S.; Carin, Lawrence; Lucas, Joseph E. Latent protein trees. Ann. Appl. Stat. 7 (2013), no. 2, 691--713. doi:10.1214/13-AOAS639.

Export citation


  • Adams, R. P., Ghahramani, Z. and Jordan, M. I. (2010). Tree-structured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 19–27. MIT Press, Cambridge, MA.
  • Aebersold, R. and Mann, M. (2003). Mass spectrometry-based proteomics. Nature 422 198–207.
  • Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions. J. R. Stat. Soc. Ser. B Stat. Methodol. 36 99–102.
  • Baggerly, K. A., Edmonson, S. R., Morris, J. S. and Coombes, K. R. (2004). High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer 11 583–584.
  • Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008). High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc. 103 1438–1456.
  • Chang, J. T. and Nevins, J. R. (2006). GATHER: A systems approach to interpreting genomic signatures. Bioinformatics 22 2926–2933.
  • Chen, M., Zaas, A., Woods, C., Ginsburg, G. S., Lucas, J., Dunson, D. and Carin, L. (2011). Predicting viral infection from high-dimensional biomarker trajectories. J. Amer. Statist. Assoc. 106 1259–1279.
  • Chhikara, R. S. and Folks, L. (1989). The Inverse Gaussian Distribution: Theory, Methodology, and Applications. Dekker, New York.
  • Clough, T., Key, M., Ott, I., Ragg, S., Schadow, G. and Vitek, O. (2009). Protein quantification in label-free LC–MS experiments. J. Proteome Res. 8 5275–5284.
  • Daly, D. S., Anderson, K. K., Panisko, E. A., Purvine, S. O., Fang, R., Monroe, M. E. and Baker, S. E. (2008). Mixed-effects statistical model for comparative LC–MS proteomics studies. Proteomics Research 7 1209–1217.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters 27 861–874.
  • Henao, R. and Lucas, J. E. (2012). Efficient hierarchical clustering for continuous data. Technical report, Institute for genome Science and Policy, Duke Univ. Available at arXiv:1204.4708.
  • Henao, R. and Winther, O. (2011). Sparse linear identifiable multivariate modeling. J. Mach. Learn. Res. 12 863–905.
  • Henao, R., Thompson, J. W., Moseley, M. A., Ginsburg, G. S., Carin, L. and Lucas, J. E. (2012). Hierarchical factor modeling of proteomics data. In IEEE 2nd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2012.
  • Henao, R., Thompson, J. W., Moseley, M. A., Ginsburg, G. S., Carin, L. and Lucas, J. E. (2013a). Supplement to “Latent protein trees.” DOI:10.1214/13-AOAS639SUPPA.
  • Henao, R., Thompson, J. W., Moseley, M. A., Ginsburg, G. S., Carin, L. and Lucas, J. E. (2013b). Supplement to “Latent protein trees.” DOI:10.1214/13-AOAS639SUPPB.
  • Henao, R., Thompson, J. W., Moseley, M. A., Ginsburg, G. S., Carin, L. and Lucas, J. E. (2013c). Supplement to “Latent protein trees.” DOI:10.1214/13-AOAS639SUPPC.
  • Kagan, A. M., Linnik, Y. V. and Rao, C. R. (1973). Characterization Problems in Mathematical Statistics. Wiley, New York.
  • Karpievitch, Y. V., Stanley, J., Taverner, T., Huang, J., Adkins, J. N., Ansong, C., Heffron, F., Metz, T. O., Qian, W. J., Yoon, H., Smith, R. D. and Dabney, A. R. (2009). A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics 25 2028–2034.
  • Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytica Chemistry 74 5384–5392.
  • Kingman, J. F. C. (1982a). The coalescent. Stochastic Process. Appl. 13 235–248.
  • Kingman, J. F. C. (1982b). On the genealogy of large populations. Essays in statistical science. J. Appl. Probab. 19 27–43.
  • Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 733–739.
  • Lucas, J. E., Thompson, J. W., Dubois, L. G., McCarthy, J., Tillman, H., Thompson, A., Shire, N., Hendrickson, R., Dieguez, F., Goldman, P., Schwartz, K., Patel, K., McHutchison, J. and Moseley, M. A. (2012). Metaprotein expression modeling for label-free quantitative proteomics. BMC Bioinformatics 3 1–18.
  • Mueller, L. N., Rinner, O., Schmidt, A., Letarte, S., Bodenmiller, B., Brusniak, M.-Y., Vitek, O., Aebersold, R. and Müller, M. (2007). SuperHirn—A novel tool for high resolution LC–MS-based peptide/protein profiling. Proteomics 7 3470–3480.
  • Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. Springer, New York.
  • Neal, R. M. (2003). Slice sampling. Ann. Statist. 31 705–741.
  • Nesvizhskii, A. I., Keller, A., Kolker, E. and Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75 4646–4658.
  • Perkins, D. N., Pappin, D. J. C., Creasy, D. M. and Cottrell, J. S. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20 3551–3567.
  • Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C. and Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359 572–577.
  • Ping, P. (2009). Getting to the heart of proteomics. N. Engl. J. Med. 360 532–534.
  • Polpitiya, A. D., Qian, W. J., Jaitly, N., Petyuk, V. A., Adkins, J. N., II, D. G. C., Anderson, G. A. and Smith, R. D. (2008). DAnTE: A statistical tool for quantitative analysis of -omics data. Bioinformatics 24 1556–1558.
  • Service, R. F. (2008). Proteomics ponders prime time. Science 321 1758–1761.
  • Teh, Y. W., Daume III, H. and Roy, D. (2008). Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. T. Roweis, eds.) 1473–1480. MIT Press, Cambridge, MA.
  • Zaas, A. K., Chen, M., Varkey, J., Veldman, T., Hero, A. O., Lucas, J., Huang, Y., Turner, R., Gilbert, A., Lambkin-Williams, R., Øien, N. C., Nicholson, B., Kingsmore, S., Carin, L., Woods, C. W. and Ginsburg, G. S. (2009). Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell 6 207–217.
  • Zhang, Z. and Chan, D. W. (2005). Cancer proteomics: In pursuit of “true” biomarker discovery. Cancer Epidemiology Biomarkers & Prevention 14 2283–2286.

Supplemental materials