The Annals of Applied Statistics

Feature extraction for proteomics imaging mass spectrometry data

Lyron J. Winderbaum, Inge Koch, Ove J. R. Gustafsson, Stephan Meding, and Peter Hoffmann

Full-text: Open access


Imaging mass spectrometry (IMS) has transformed proteomics by providing an avenue for collecting spatially distributed molecular data. Mass spectrometry data acquired with matrix assisted laser desorption ionization (MALDI) IMS consist of tens of thousands of spectra, measured at regular grid points across the surface of a tissue section. Unlike the more standard liquid chromatography mass spectrometry, MALDI-IMS preserves the spatial information inherent in the tissue.

Motivated by the need to differentiate cell populations and tissue types in MALDI-IMS data accurately and efficiently, we propose an integrated cluster and feature extraction approach for such data. We work with the derived binary data representing presence/absence of ions, as this is the essential information in the data. Our approach takes advantage of the spatial structure of the data in a noise removal and initial dimension reduction step and applies $k$-means clustering with the cosine distance to the high-dimensional binary data. The combined smoothing-clustering yields spatially localized clusters that clearly show the correspondence with cancer and various noncancerous tissue types.

Feature extraction of the high-dimensional binary data is accomplished with our difference in proportions of occurrence (DIPPS) approach which ranks the variables and selects a set of variables in a data-driven manner. We summarize the best variables in a single image that has a natural interpretation. Application of our method to data from patients with ovarian cancer shows good separation of tissue types and close agreement of our results with tissue types identified by pathologists.

Article information

Ann. Appl. Stat., Volume 9, Number 4 (2015), 1973-1996.

Received: June 2014
Revised: August 2015
First available in Project Euclid: 28 January 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Proteomics mass spectrometry data high-dimensional binary data MALDI-IMS unsupervised feature extraction


Winderbaum, Lyron J.; Koch, Inge; Gustafsson, Ove J. R.; Meding, Stephan; Hoffmann, Peter. Feature extraction for proteomics imaging mass spectrometry data. Ann. Appl. Stat. 9 (2015), no. 4, 1973--1996. doi:10.1214/15-AOAS870.

Export citation


  • Aebersold, R. and Mann, M. (2003). Mass spectrometry-based proteomics. Nature 422 198–207.
  • Alexandrov, T. and Bartels, A. (2013). Testing for presence of known and unknown molecules in imaging mass spectrometry. Bioinformatics 29 2335–2342.
  • Alexandrov, T. and Kobarg, J. H. (2011). Efficient spatial segmentation of large imaging mass spectrometry datasets with spatially aware clustering. Bioinformatics 13 i230–i238.
  • Alexandrov, T., Becker, M., Deininger, S.-O., Ernst, G., Wehder, L., Grasmair, M., von Eggeling, F., Thiele, H. and Maass, P. (2010). Spatial segmentation of imaging mass spectrometry data with edge-preserving image denoising and clustering. J. Proteome Res. 9 6535–6546.
  • Alexandrov, T., Chernyavsky, I., Becker, M., von Eggeling, F. and Nikolenko, S. (2013). Analysis and interpretation of imaging mass spectrometry data by clustering mass-to-charge images according to their spatial similarity. Analytical Chemistry 85 11189–11195.
  • America, A. H. and Cordewener, J. H. (2008). Comparative LC-MS: A landscape of peaks and valleys. Proteomics 8 731–749.
  • Aoki, Y., Toyama, A., Shimada, T., Sugita, T., Aoki, C., Umino, Y., Suzuki, A., Aoki, D., Daigo, Y., Nakamura, Y. et al. (2007). A novel method for analyzing formalin-fixed paraffin embedded (FFPE) tissue sections by mass spectrometry imaging. Proceedings of the Japan Academy. Series B, Physical and Biological Sciences 83 205–214.
  • Bonnel, D., Longuespee, R., Franck, J., Roudbaraki, M., Gosset, P., Day, R., Salzet, M. and Fournier, I. (2011). Multivariate analyses for biomarkers hunting and validation through on-tissue bottom-up or in-source decay in MALDI-MSI: Application to prostate cancer. Anal. Bioanal. Chem. 401 149–165.
  • Casadonte, R. and Caprioli, R. M. (2011). Proteomic analysis of formalin-fixed paraffin-embedded tissue by MALDI imaging mass spectrometry. Nat. Protoc. 6 1695–1709.
  • Cornett, D. S., Reyzer, M. L., Chaurand, P. and Caprioli, R. M. (2007). MALDI imaging mass spectrometry: Molecular snapshots of biochemical systems. Nat. Methods 4 828–833.
  • Deininger, S.-O., Ebert, M. P., Fütterer, A., Gerhard, M. and Röcken, C. (2008). MALDI imaging combined with hierarchical clustering as a new tool for the interpretation of complex human cancers. J. Proteome Res. 7 5230–5236. PMID: 19367705.
  • Deininger, S.-O., Cornett, D. S., Paape, R., Becker, M., Pineau, C., Rauser, S., Walch, A. and Wolski, E. (2011). Normalization in MALDI-TOF imaging datasets of proteins: Practical considerations. Anal. Bioanal. Chem. 401 167–181.
  • Deutskens, F., Yang, J. and Caprioli, R. M. (2011). High spatial resolution imaging mass spectrometry and classical histology on a single tissue section. J. Mass Spectrom. 46 568–571.
  • Du, P., Kibbe, W. A. and Lin, S. M. (2006). Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics 22 2059–2065.
  • Garden, R. W. and Sweedler, J. V. (2000). Heterogeneity within MALDI samples as revealed by mass spectrometric imaging. Analytical Chemistry 72 30–36.
  • Gardner, M. (1970). Mathematical games: The fantastic combinations of John Conway’s new solitaire game “life”. Scientific American 223 120–123.
  • Gessel, M. M., Norris, J. L. and Caprioli, R. M. (2014). MALDI imaging mass spectrometry: Spatial molecular analysis to enable a new age of discovery. Journal of Proteomics 107 71–82. Special Issue: 20 in memory of Vitaliano Pallini.
  • Gorzolka, K. and Walch, A. (2014). November. MALDI mass spectrometry imaging of formalin-fixed paraffin-embedded tissues in clinical research. Histology and Histopathology 29 1365–1376.
  • Gray, L. (2003). A mathematician looks at S. Wolfram’s new kind of science. Notices Amer. Math. Soc. 50 200–211.
  • Groseclose, M. R., Andersson, M., Hardesty, W. M. and Caprioli, R. M. (2006). Identification of proteins directly from tissue: In situ tryptic digestions coupled with imaging mass spectrometry. J. Mass Spectrom. 42 254–262.
  • Groseclose, M. R., Massion, P. P., Chaurand, P. and Caprioli, R. M. (2008). High-throughput proteomic analysis of formalin-fixed paraffin-embedded tissue microarrays using maldi imaging mass spectrometry. Proteomics 8 3715–3724.
  • Gustafsson, O. J. R. (2012). Molecular characterization of metastatic ovarian cancer by MALDI imaging mass spectrometry. Ph.D. thesis, School of Molecular and Biomedical Science, Univ. Adelaide.
  • Gustafsson, J. O. R., Oehler, M. K., Ruszkiewicz, A., McColl, S. R. and Hoffmann, P. (2011). MALDI imaging mass spectrometry (MALDI-IMS)—application of spatial proteomics for ovarian cancer classification and diagnosis. Int. J. Mol. Sci. 12 773–794.
  • Gustafsson, J. O., Eddes, J. S., Meding, S., Koudelka, T., Oehler, M. K., McColl, S. R. and Hoffmann, P. (2012). Internal calibrants allow high accuracy peptide matching between MALDI imaging MS and LC-MS/MS. Journal of Proteomics 75 5093–5105. Special Issue: Imaging Mass Spectrometry: A Users Guide to a New Technique for Biological and Biomedical Research.
  • Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. and Aebersold, R. (2000). Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc. Natl. Acad. Sci. USA 97 9390–9395.
  • Jaccard, P. (1901). Distribution de la Flore Alpine: Dans le Bassin des dranses et dans quelques régions voisines. Rouge.
  • Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E. and Forman, D. (2011). Global cancer statistics. CA: A Cancer Journal for Clinicians 61 69–90.
  • Jones, E. A., van Remoortere, A., van Zeijl, R. J., Hogendoorn, P. C., Bovée, J. V., Deelder, A. M. and McDonnell, L. A. (2011). Multiple statistical analysis techniques corroborate intratumor heterogeneity in imaging mass spectrometry datasets of myxofibrosarcoma. PloS One 6 e24913.
  • Jones, E. A., Deininger, S.-O., Hogendoorn, P. C., Deelder, A. M. and McDonnell, L. A. (2012). Imaging mass spectrometry statistical analysis. Journal of Proteomics 75 4962–4989. Special Issue: Imaging Mass Spectrometry: A User’s Guide to a New Technique for Biological and Biomedical Research.
  • Karpievitch, Y. V., Polpitiya, A. D., Anderson, G. A., Smith, R. D. and Dabney, A. R. (2010). Liquid chromatography mass spectrometry-based proteomics: Biological and technological aspects. Ann. Appl. Stat. 4 1797–1823.
  • Koch, I. (2013). Analysis of Multivariate and High-Dimensional Data. Cambridge Univ. Press, New York.
  • Koenig, T., Menze, B. H., Kirchner, M., Monigatti, F., Parker, K. C., Patterson, T., Steen, J. J., Hamprecht, F. A. and Steen, H. (2008). Robust prediction of the mascot score for an improved quality assessment in mass spectrometric proteomics. J. Proteome Res. 7 3708–3717.
  • Meding, S., Martin, K., Gustafsson, O. J., Eddes, J. S., Hack, S., Oehler, M. K. and Hoffmann, P. (2012). Tryptic peptide reference data sets for MALDI imaging mass spectrometry on formalin-fixed ovarian cancer tissues. J. Proteome Res. 12 308–315.
  • Morris, J. S. (2012). Statistical methods for proteomic biomarker discovery based on feature extraction or functional modeling approaches. Stat. Interface 5 117–135.
  • Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 179–199.
  • Morris, J. S., Coombes, K. R., Koomen, J., Baggerly, K. A. and Kobayashi, R. (2005). Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21 1764–1775.
  • Norris, J. L., Cornett, D. S., Mobley, J. A., Andersson, M., Seeley, E. H., Chaurand, P. and Caprioli, R. M. (2007). Processing MALDI mass spectra to improve mass spectral direct tissue analysis. Int. J. Mass Spectrom. Ion Phys. 260 212–221.
  • Ong, S.-E. and Mann, M. (2005). Mass spectrometry-based proteomics turns quantitative. Nature Chemical Biology 1 252–262.
  • Ricciardelli, C. and Oehler, M. K. (2009). Diverse molecular pathways in ovarian cancer and their clinical significance. Maturitas 62 270–275.
  • Rogowska-Wrzesinska, A., Le Bihan, M.-C., Thaysen-Andersen, M. and Roepstorff, P. (2013). 2D gels still have a niche in proteomics. Journal of Proteomics 88 4–13.
  • Schober, Y., Guenther, S., Spengler, B. and Römpp, A. (2012). Single cell matrix-assisted laser desorption/ionization mass spectrometry imaging. Analytical Chemistry 84 6293–6297.
  • Steurer, S., Borkowski, C., Odinga, S., Buchholz, M., Koop, C., Huland, H., Becker, M., Witt, M., Trede, D., Omidi, M. et al. (2013). MALDI mass spectrometric imaging based identification of clinically relevant signals in prostate cancer using large-scale tissue microarrays. Int. J. Cancer 133 920–928.
  • Stone, G., Clifford, D., Gustafsson, J. O., McColl, S. R. and Hoffmann, P. (2012). Visualisation in imaging mass spectrometry using the minimum noise fraction transform. BMC Research Notes 5 419.
  • Tekwe, C. D., Carroll, R. J. and Dabney, A. R. (2012). Application of survival analysis methodology to the quantitative analysis of LC-MS proteomics data. Bioinformatics 28 1998–2003.
  • Tomasi, C. and Manduchi, R. (1998). Bilateral filtering for gray and color images. 839–846, cited by 2167.
  • Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, London.
  • Wasinger, V. C., Cordwell, S. J., Cerpa-Poljak, A., Yan, J. X., Gooley, A. A., Wilkins, M. R., Duncan, M. W., Harris, R., Williams, K. L. and Humphery-Smith, I. (1995). Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophoresis 16 1090–1094.
  • Wilkins, M. R., Pasquali, C., Appel, R. D., Ou, K., Golaz, O., Sanchez, J.-C., Yan, J. X., Gooley, A. A., Hughes, G., Humphery-Smith, I. et al. (1996). From proteins to proteomes: Large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Nature Biotechnology 14 61–65.
  • Winderbaum, L. J., Koch, I., Gustafsson, O., Meding, S. and Hoffmann, P. (2015a). Supplement to “Feature extraction for proteomics imaging mass spectrometry data.” DOI:10.1214/15-AOAS870SUPPA.
  • Winderbaum, L. J., Koch, I., Gustafsson, O., Meding, S. and Hoffmann, P. (2015b). Supplement to “Feature extraction for proteomics imaging mass spectrometry data.” DOI:10.1214/15-AOAS870SUPPB.
  • Winderbaum, L. J., Koch, I., Gustafsson, O., Meding, S. and Hoffmann, P. (2015c). Supplement to “Feature extraction for proteomics imaging mass spectrometry data.” DOI:10.1214/15-AOAS870SUPPC.
  • Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K. and Zhao, H. (2003). Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19 1636–1643.
  • Yu, W., Wu, B., Huang, T., Li, X., Williams, K. and Zhao, H. (2006). Statistical methods in proteomics. In Springer Handbook of Engineering Statistics 623–638. Springer, Berlin.

Supplemental materials

  • Supplement A: Immunihistochemical Validation. Optical images of immunohistochemical (IHC) tissue stains, validating three proteins as cancer-specific, including the two inferred parent proteins of Table 1. Top row are patient A replicates, bottom row patient C replicates.
  • Supplement B: Source Code. Source code including cache and intermediate data files capable of reproducing all analyses up to and including compiling this document. Computations where done in MATLAB, and results compiled in LaTeX using the R package knitr.
  • Supplement C: Peaklist Data. Raw peaklist data, used to generate the intermediate data files in Supplement B [Winderbaum et al. (2015b)].