The Annals of Applied Statistics

Liquid chromatography mass spectrometry-based proteomics: Biological and technological aspects

Yuliya V. Karpievitch, Ashoka D. Polpitiya, Gordon A. Anderson, Richard D. Smith, and Alan R. Dabney

Full-text: Open access


Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used “bottom-up” approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.

Article information

Ann. Appl. Stat., Volume 4, Number 4 (2010), 1797-1823.

First available in Project Euclid: 4 January 2011

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

LC-MS proteomics statistics


Karpievitch, Yuliya V.; Polpitiya, Ashoka D.; Anderson, Gordon A.; Smith, Richard D.; Dabney, Alan R. Liquid chromatography mass spectrometry-based proteomics: Biological and technological aspects. Ann. Appl. Stat. 4 (2010), no. 4, 1797--1823. doi:10.1214/10-AOAS341.

Export citation


  • Baggerly, K. A., Morris, J. S. and Coombes, K. R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 20 777–785.
  • Belov, M. E. et al. (2007). Multiplexed ion mobility spectrometry-orthogonal time-of-flight mass spectrometry. Anal. Chem. 79 2451–2462.
  • Berth, M. et al. (2007). The state of the art in the analysis of two-dimensional gel electrophoresis images. Appl. Microbiol. Biotechnol. 76 1223–1243.
  • Bondarenko, P. V., Chelius, D. and Shaler, T. A. (2002). Identification and relative quantitation of protein mixtures by enzymatic digestion followed by capillary reversed–phase liquid chromatography–tandem mass spectrometry. Anal. Chem. 74 4741–4749.
  • Callister, S. J. et al. (2006). Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J. Proteome Res. 5 277–286.
  • Caprioli, R. M., Farmer, T. B. and Gile, J. (1997). Molecular imaging of biological samples: Localization of peptides and proteins using MALDI-TOF MS. Anal. Chem. 69 4751–4760.
  • Chelius, D. and Bondarenko, P. V. (2002). Quantitative profiling of proteins in complex mixtures using liquid chromatography and mass spectrometry. J. Proteome Res. 1 317–323.
  • Choi, H., Fermin, D. and Nesvizhskii, A. I. (2008). Significance analysis of spectral count data in label-free shotgun proteomics. Mol. Cell. Proteomics 7 2373–2385.
  • Choi, H., Ghosh, D. and Nesvizhskii, A. I. (2008). Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. J. Proteome Res. 7 286–292.
  • Choi, H. and Nesvizhskii, A. I. (2008a). Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 7 254–265.
  • Choi, H. and Nesvizhskii, A. I. (2008b). False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J. Proteome Res. 7 47–50.
  • Cornett, D. S. et al. (2007). MALDI imaging mass spectrometry: Molecular snapshots of biochemical systems. Nat. Methods 4 828–833.
  • Craig, R. and Beavis, R. C. (2004). TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 20 1466–1467.
  • Dabney, A. R. and Storey, J. D. (2006). A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol. 7 401.
  • Dancik, V. et al. (1999). De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6 327–342.
  • Deutsch, E. (2008). mzML: A single, unifying data format for mass spectrometer output. Proteomics 8 2776–2777.
  • Ding, Y., Choi, H. and Nesvizhskii, A. I. (2008). Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. J. Proteome Res. 7 4878–4889.
  • Domon, B. and Aebersold, R. (2006). Mass spectrometry and protein analysis. Science 312 212–217.
  • Dougherty, E. R. (2009). Translational science: Epistemology and the investigative process. Current Genomics 10 102–109.
  • Elias, J. E. and Gygi, S. P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4 207–214.
  • Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd. (1994). An approach to correlate MS/MS data to amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5 976–989.
  • Fenyö, D. and Beavis, R. C. (2003). A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75 768–774.
  • Finney, G. L. et al. (2008). Label-free comparative analysis of proteomics mixtures using chromatographic alignment of high-resolution muLC-MS data. Anal. Chem. 80 961–971.
  • Frank, A. and Pevzner, P. (2005). PepNovo: De novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77 964–973.
  • Garza, S. and Moini, M. (2006). Analysis of complex protein mixtures with improved sequence coverage using (CE-MS/MS)n. Anal. Chem. 78 7309–7316.
  • Ghaemmaghami, S. et al. (2003). Global analysis of protein expression in yeast. Nature 425 737–741.
  • Gorg, A., Weiss, W. and Dunn, M. J. (2004). Current two-dimensional electrophoresis technology for proteomics. Proteomics 4 3665–3685.
  • Goshe, M. B. and Smith, R. D. (2003). Stable isotope-coded proteomic mass spectrometry. Curr. Opin. Biotechnol. 14 101–109.
  • Guerrera, I. C. and Kleiner, O. (2005). Application of mass spectrometry in proteomics. Biosci. Rep. 25 71–93.
  • Gygi, S. P. et al. (1999). Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17 994–999.
  • Han, X., Aslanian, A. and Yates, J. R., 3rd. (2008). Mass spectrometry for proteomics. Curr. Opin. Chem. Biol. 12 483–490.
  • Hand, D. J. (2006). Classifier technology and the illusion of progress. Statist. Sci. 21 1–15.
  • Horn, D. M., Zubarev, R. A. and McLafferty, F. W. (2000). Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J. Am. Soc. Mass Spectrom. 11 320–332.
  • Jaitly, N. et al. (2006). Robust algorithm for alignment of liquid chromatography–mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Anal. Chem. 78 7397–7409.
  • Jaitly, N. et al. (2009). Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data. BMC Bioinformatics 10 87.
  • Johnson, R. S. et al. (2005). Informatics for protein identification by mass spectrometry. Methods 35 223–236.
  • Käll, L. et al. (2008a). Posterior error probabilities and false discovery rates: Two sides of the same coin. J. Proteome Res. 7 40–44.
  • Käll, L. et al. (2008b). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 7 29–34.
  • Kapp, E. and Schutz, F. (2007). Overview of tandem mass spectrometry (MS/MS) database search algorithms. Curr. Protoc. Protein. Sci. Chapter 25 Unit25 22.
  • Karas, M. et al. (1987). Matrix-assisted ultraviolet laser desorption of non-volatile compounds. International Journal of Mass Spectrometry and Ion Processes 78 53–68.
  • Karpievitch, Y. et al. (2009a). A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics 25 2028–2034.
  • Karpievitch, Y. V. et al. (2009b). Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 25 2573–2580.
  • Keller, A. et al. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74 5383–5392.
  • Klose, J. (1975). Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik 26 231–243.
  • Klose, J. and Kobalz, U. (1995). Two-dimensional electrophoresis of proteins: An updated protocol and implications for a functional analysis of the genome. Electrophoresis 16 1034–1059.
  • Laskin, J. and Futrell, J. H. (2003). Collisional activation of peptide ions in FT-ICR mass spectrometry. Mass Spectrom. Rev. 22 158–181.
  • Lee, H. J. et al. (2006). Biomarker discovery from the plasma proteome using multidimensional fractionation proteomics. Curr. Opin. Chem. Biol. 10 42–49.
  • Leek, J. T. and Storey J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724–1735.
  • Li, C. and Wong, W. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. 98 31–36.
  • Lin, S. M. et al. (2005). What is mzXML good for?. Expert Rev. Proteomics 2 839–845.
  • Link, A. J. et al. (1999). Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17 676–682.
  • Liu, C. et al. (2006). Peptide sequence tag-based blind identification of post-translational modifications with point process model. Bioinformatics 22 e307–313.
  • Liu, H., Sadygov, R. G. and Yates, J. R., 3rd (2004). A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76 4193–4201.
  • Lu, B. and Chen, T. (2003). A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 10 1–12.
  • Mallick, P. et al. (2007). Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25 125–131.
  • Mann, M. and Wilm, M. (1994). Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66 4390–4399.
  • Masselon, C. D. et al. (2008). Influence of mass resolution on species matching in accurate mass and retention time (AMT) tag proteomics experiments. Rapid Commun. Mass Spectrom. 22 986–992.
  • Morris, J. S., Clark, B. N. and Gutstein, H. B. (2008). Pinnacle: A fast, automatic and accurate method for detecting and quantifying protein spots in 2-dimensional gel electrophoresis data. Bioinformatics 24 529–536.
  • Nesvizhskii, A. I. (2007). Protein identification by tandem mass spectrometry and sequence database searching. Methods Mol. Biol. 367 87–119.
  • Nesvizhskii, A. I. and Aebersold, R. (2004). Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov. Today 9 173–181.
  • Nesvizhskii, A. I. et al. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75 4646–4658.
  • Nesvizhskii, A. I., Vitek, O. and Aebersold, R. (2007). Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4 787–797.
  • Nguyen, D. V. et al. (2002). DNA microarray experiments: Biological and technological aspects. Biometrics 58 701–717.
  • O’Farrell, P. H. (1975). High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 250 4007–4021.
  • Oberg, A. L. and Vitek, O. (2009). Statistical design of quantitative mass spectrometry-based proteomic experiments. J. Proteome Res. 8 2144–2156.
  • Oda, Y. et al. (1999). Accurate quantitation of protein expression and site-specific phosphorylation. Proc. Natl. Acad. Sci. USA 96 6591–6596.
  • Old, W. M. et al. (2005). Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 4 1487–1502.
  • Ong, S. E. et al. (2002). Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1 376–386.
  • Ong, S. E. and Mann, M. (2005). Mass spectrometry-based proteomics turns quantitative. Nat. Chem. Biol. 1 252–262.
  • Orchard, S. et al. (2009). Managing the data explosion. A report on the HUPO-PSI Workshop. August 2008, Amsterdam, The Netherlands. Proteomics 9 499–501.
  • Orchard, S. et al. (2007). Proteomic data exchange and storage: The need for common standards and public repositories. Methods Mol. Biol. 367 261–270.
  • Pasa-Tolic, L. et al. (2004). Proteomic analyses using an accurate mass and time tag strategy. BioTechniques 37 621–624, 626–633, 636 passim.
  • Pedrioli, P. G. et al. (2004). A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 22 1459–1466.
  • Peng, J. et al. (2003). Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J. Proteome Res. 2 43–50.
  • Perkins, D. N. et al. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20 3551–3567.
  • Petricoin, E. F. et al. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359 572–577.
  • Petritis, K. et al. (2006). Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information. Anal. Chem. 78 5026–5039.
  • Petyuk, V. A. et al. (2008). Elimination of systematic mass measurement errors in liquid chromatography–mass spectrometry based proteomics using regression models and a priori partial knowledge of the sample content. Anal. Chem. 80 693–706.
  • Pittenauer, E. and Allmaier, G. (2009). High-energy collision induced dissociation of biomolecules: MALDI-TOF/RTOF mass spectrometry in comparison to tandem sector mass spectrometry. Comb. Chem. High Throughput Screen 12 137–155.
  • Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet. 32 Suppl 496–501.
  • Ram, R. J. et al. (2005). Community proteomics of a natural microbial biofilm. Science 308 1915–1920.
  • Ross, P. L. et al. (2004). Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 3 1154–1169.
  • Sadygov, R. G., Cociorva, D. and Yates, J. R., 3rd. (2004). Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nat. Methods 1 195–202.
  • Sandra, K. et al. (2008). Highly efficient peptide separations in proteomics. Part 1. Unidimensional high performance liquid chromatography. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 866 48–63.
  • Sandra, K. et al. (2009). Highly efficient peptide separations in proteomics. Part 2: Bi- and multidimensional liquid-based separation techniques. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 877 1019–1039.
  • Schnolzer, M., Jedrzejewski, P. and Lehmann, W. D. (1996). Protease-catalyzed incorporation of 18O into peptide fragments and its application for protein sequencing by electrospray and matrix-assisted laser desorption/ionization mass spectrometry. Electrophoresis 17 945–953.
  • Searle, B. C., Turner, M. and Nesvizhskii, A. I. (2008). Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J. Proteome Res. 7 245–253.
  • Shen, C. et al. (2008). A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics 24 202–208.
  • Siuzdak, G. (2003). The Expanding Role of Mass Spectrometry in Biotechnology. MCC Press, San Diego.
  • Sleno, L. and Volmer, D. A. (2004). Ion activation methods for tandem mass spectrometry. J. Mass Spectrom. 39 1091–1112.
  • Sobott, F. et al. (2009). Comparison of CID versus ETD based MS/MS fragmentation for the analysis of protein ubiquitination. J. Am. Soc. Mass Spectrom. 20 1652–1659.
  • Standing, K. G. (2003). Peptide and protein de novo sequencing by mass spectrometry. Curr. Opin. Struct. Biol. 13 595–601.
  • Stoeckli, M. et al. (2001). Imaging mass spectrometry: A new technology for the analysis of protein expression in mammalian tissues. Nat. Med. 7 493–496.
  • Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100 9440–9445.
  • Sunyaev, S. et al. (2003). MultiTag: Multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Anal. Chem. 75 1307–1315.
  • Tabb, D. L., Saraf, A. and Yates, J. R., 3rd. (2003). GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75 6415–6421.
  • Tang, N., Tornatore, P. and Weinberger, S. R. (2004). Current developments in SELDI affinity technology. Mass Spectrom. Rev. 23 34–44.
  • Tanner, S. et al. (2005). InsPecT: Identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77 4626–4639.
  • Thompson, A. et al. (2003). Tandem mass tags: A novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal. Chem. 75 1895–1904.
  • Tolmachev, A. V. et al. (2008). Characterization of strategies for obtaining confident identifications in bottom-up proteomics measurements using hybrid FTMS instruments. Anal. Chem. 80 8514–8525.
  • Wang, P. et al. (2006). Normalization regarding non-random missing values in high-throughput mass spectrometry data. Pacific Symposium of Biocomputing 315–326.
  • Wang, W. et al. (2003). Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75 4818–4826.
  • Washburn, M. P., Wolters, D. and Yates, J. R., 3rd. (2001). Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19 242–247.
  • Weiss, W. and Gorg, A. (2009). High-resolution two-dimensional electrophoresis. Methods Mol. Biol. 564 13–32.
  • Wells, J. M. and McLuckey, S. A. (2005). Collision-induced dissociation (CID) of peptides and proteins. Methods Enzymol. 402 148–185.
  • Wiese, S. et al. (2007). Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics 7 340–350.
  • Wilkins, M. et al. (1996). Progress with proteome projects: Why all proteins expressed by a genome should be identified and how to do it. Biotechnol. Genet. Eng. Rev. 13 19–50.
  • Wolters, D. A., Washburn, M. P. and Yates, J. R., 3rd. (2001). An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 73 5683–5690.
  • Yanofsky, C. M. et al. (2008). A Bayesian approach to peptide identification using accurate mass and time tags from LC-FTICR-MS proteomics experiments. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2008 3775–3778.
  • Yates, J. R., 3rd. (1998). Database searching using mass spectrometry data. Electrophoresis 19 893–900.
  • Ye, X. et al. (2009). 18O stable isotope labeling in MS-based proteomics. Brief Funct. Genomic Proteomic 8 136–144.
  • Zhang, H. et al. (2005). High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Mol. Cell. Proteomics 4 144–155.
  • Zhang, Y. et al. (2009). Effect of dynamic exclusion duration on spectral count based quantitative proteomics. Anal. Chem.