The Annals of Applied Statistics

A nested mixture model for protein identification using mass spectrometry

Qunhua Li, Michael J. MacCoss, and Matthew Stephens

Full-text: Open access


Mass spectrometry provides a high-throughput way to identify proteins in biological samples. In a typical experiment, proteins in a sample are first broken into their constituent peptides. The resulting mixture of peptides is then subjected to mass spectrometry, which generates thousands of spectra, each characteristic of its generating peptide. Here we consider the problem of inferring, from these spectra, which proteins and peptides are present in the sample. We develop a statistical approach to the problem, based on a nested mixture model. In contrast to commonly used two-stage approaches, this model provides a one-stage solution that simultaneously identifies which proteins are present, and which peptides are correctly identified. In this way our model incorporates the evidence feedback between proteins and their constituent peptides. Using simulated data and a yeast data set, we compare and contrast our method with existing widely used approaches (PeptideProphet/ProteinProphet) and with a recently published new approach, HSM. For peptide identification, our single-stage approach yields consistently more accurate results. For protein identification the methods have similar accuracy in most settings, although we exhibit some scenarios in which the existing methods perform poorly.

Article information

Ann. Appl. Stat., Volume 4, Number 2 (2010), 962-987.

First available in Project Euclid: 3 August 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Mixture model nested structure EM algorithm protein identification peptide identification mass spectrometry proteomics


Li, Qunhua; MacCoss, Michael J.; Stephens, Matthew. A nested mixture model for protein identification using mass spectrometry. Ann. Appl. Stat. 4 (2010), no. 2, 962--987. doi:10.1214/09-AOAS316.

Export citation


  • Blei, D., Gri, T., Jordan, M. and Tenenbaum, J. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems 18. MIT Press.
  • Choi, H. and Nesvizhskii, A. I. (2008). Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 7 254–265.
  • Coon, J. J., Syka, J. E., Shabanowitz, J. and Hunt, D. (2005). Tandem mass spectrometry for peptide and proteins sequence analysis. BioTechniques 38 519–521.
  • Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. G. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Elias, J., Faherty, B. and Gygi, S. (2005). Comparative evaluation of mass spectrometry platforms used in large-scale proteomics inverstigations. Nature Methods 2 667–675.
  • Elias, J. and Gygi, S. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods 4 207–214.
  • Eng, J., McCormack, A. and Yates, J. I. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5 976–989.
  • Feng, J., Naiman, Q. and Cooper, B. (2007). Probability model for assessing protein assembled from peptide sequences inferred from tandem mass spectrometry data. Anal. Chem. 79 3901–3911.
  • Kall, L., Canterbury, J., Weston, J., Noble, W. S. and MacCoss, M. J. (2007). A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets. Nature Methods 4 923–925.
  • Keller, A. Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D. R. and Kolker, E. (2002). Experimental protein mixture for validating tandem mass spectral analysis. Omics 6 207–212.
  • Keller, A., Nesvizhskii, A., Kolker, E. and Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74 5383–5392.
  • Kinter, M. and Sherman, N. E. (2003). Protein Sequencing and Identification Using Tandem Mass Spectrometry. Wiley, New York.
  • Li, Q. (2008). Statistical methods for peptide and protein identification in mass spectrometry. Ph.D. thesis, Univ. Washington, Seattle, WA.
  • Nesvizhskii, A. I. and Aebersold, R. (2004). Analysis, statistical validation and dissermination of large-scale proteomics datasets generated by tandem MS. Drug Discovery Todays 9 173–181.
  • Nesvizhskii, A. I., Keller, A., Kolker, E. and Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75 4646–4653.
  • Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176.
  • Price, T. S., Lucitt, M. B., Wu, W., Austin, D. J., Pizarro, A., Yocum, A. K., Blair, I. A., FitzGerald, G. A. and Grosser, T. (2007). EBP, a program for protein identification using multiple tandem mass spectrometry data sets. Mol. Cell. Proteomics 6 527–536.
  • Purvine, S., Picone, A. F. and Kolker, E. (2004). Standard mixtures for proteome studies. Omics 8 79–92.
  • Sadygov, R., Cociorva, D. and Yates, J. (2004). Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nature Methods 1 195–202.
  • Sadygov, R., Liu, H. and Yates, J. (2004). Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76 1664–1671.
  • Sadygov, R. and Yates, J. (2003). A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75 3792–3798.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shen, C., Wang, Z., Shankar, G., Zhang, X. and Li, L. (2008). A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics 24 202–208.
  • Steen, H. and Mann, M. (2004). The abc’s (and xyz’s) of peptide sequencing. Nature Reviews 5 699–712.
  • Tabb, D., McDonald, H. and Yates, J. I. (2002). Dtaselect and contrast: Tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1 21–36.
  • Vermunt, J. K. (2003). Multilevel latent class models. Sociological Methodology 33 213–239.