Open Access
March 2016 Variable selection and prediction with incomplete high-dimensional data
Ying Liu, Yuanjia Wang, Yang Feng, Melanie M. Wall
Ann. Appl. Stat. 10(1): 418-450 (March 2016). DOI: 10.1214/15-AOAS899


We propose a Multiple Imputation Random Lasso (MIRL) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study $80\%$ of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.


Download Citation

Ying Liu. Yuanjia Wang. Yang Feng. Melanie M. Wall. "Variable selection and prediction with incomplete high-dimensional data." Ann. Appl. Stat. 10 (1) 418 - 450, March 2016.


Received: 1 July 2014; Revised: 1 December 2015; Published: March 2016
First available in Project Euclid: 25 March 2016

zbMATH: 06586151
MathSciNet: MR3480502
Digital Object Identifier: 10.1214/15-AOAS899

Keywords: missing data , multiple imputation , random lasso , stability selection , variable ranking , Variable selection

Rights: Copyright © 2016 Institute of Mathematical Statistics

Vol.10 • No. 1 • March 2016
Back to Top