The Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 9, Number 4 (2015), 1973-1996.
Feature extraction for proteomics imaging mass spectrometry data
Imaging mass spectrometry (IMS) has transformed proteomics by providing an avenue for collecting spatially distributed molecular data. Mass spectrometry data acquired with matrix assisted laser desorption ionization (MALDI) IMS consist of tens of thousands of spectra, measured at regular grid points across the surface of a tissue section. Unlike the more standard liquid chromatography mass spectrometry, MALDI-IMS preserves the spatial information inherent in the tissue.
Motivated by the need to differentiate cell populations and tissue types in MALDI-IMS data accurately and efficiently, we propose an integrated cluster and feature extraction approach for such data. We work with the derived binary data representing presence/absence of ions, as this is the essential information in the data. Our approach takes advantage of the spatial structure of the data in a noise removal and initial dimension reduction step and applies $k$-means clustering with the cosine distance to the high-dimensional binary data. The combined smoothing-clustering yields spatially localized clusters that clearly show the correspondence with cancer and various noncancerous tissue types.
Feature extraction of the high-dimensional binary data is accomplished with our difference in proportions of occurrence (DIPPS) approach which ranks the variables and selects a set of variables in a data-driven manner. We summarize the best variables in a single image that has a natural interpretation. Application of our method to data from patients with ovarian cancer shows good separation of tissue types and close agreement of our results with tissue types identified by pathologists.
Ann. Appl. Stat., Volume 9, Number 4 (2015), 1973-1996.
Received: June 2014
Revised: August 2015
First available in Project Euclid: 28 January 2016
Permanent link to this document
Digital Object Identifier
Mathematical Reviews number (MathSciNet)
Zentralblatt MATH identifier
Winderbaum, Lyron J.; Koch, Inge; Gustafsson, Ove J. R.; Meding, Stephan; Hoffmann, Peter. Feature extraction for proteomics imaging mass spectrometry data. Ann. Appl. Stat. 9 (2015), no. 4, 1973--1996. doi:10.1214/15-AOAS870. https://projecteuclid.org/euclid.aoas/1453994187
- Supplement A: Immunihistochemical Validation. Optical images of immunohistochemical (IHC) tissue stains, validating three proteins as cancer-specific, including the two inferred parent proteins of Table 1. Top row are patient A replicates, bottom row patient C replicates.
- Supplement B: Source Code. Source code including cache and intermediate data files capable of reproducing all analyses up to and including compiling this document. Computations where done in MATLAB, and results compiled in LaTeX using the R package knitr.
- Supplement C: Peaklist Data. Raw peaklist data, used to generate the intermediate data files in Supplement B [Winderbaum et al. (2015b)].