The Annals of Applied Statistics

Using linear predictors to impute allele frequencies from summary or pooled genotype data

Xiaoquan Wen and Matthew Stephens

Full-text: Open access


Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas, in practice, it is often the case that only summary data are available. For example, this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straightforward, and related to a long history of the use of linear methods for estimating missing values (e.g., Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible—allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context—these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the-art imputation methods that use individual-level data, but at a fraction of the computational cost.

Article information

Ann. Appl. Stat., Volume 4, Number 3 (2010), 1158-1182.

First available in Project Euclid: 18 October 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Regularized linear predictor shrinkage estimation genotype imputation genetic association study


Wen, Xiaoquan; Stephens, Matthew. Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 4 (2010), no. 3, 1158--1182. doi:10.1214/10-AOAS338.

Export citation


  • Browning, S. and Browning, B. (2007). Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. American Journal of Human Genetics 81 1084–1097.
  • Clayton, D., Chapman, J. and Cooper, J. (2004). Use of unphased multilocus genotype data in indirect association studies. Genetic Epidemiology 27 415–428.
  • de Bakker, P., Yelensky, R., Pe’er, I., Gabriel, S., Daly, M. and Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics 37 1217–1223.
  • Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association mapping. PLoS Genetics 4 e1000279.
  • Homer, N., Tembe, W., Szelinger, S., Redman, M., Stephan, D., Pearson, J., Nelson, D. and Craig, D. (2008a). Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies. Bioinformatics 24 1896–1902.
  • Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J., Stephan, D., Nelson, S. and Craig, D. (2008b). Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genetics 4 e1000167.
  • Howie, B., Donnelly, P. and Marchini, J. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5 e1000529.
  • Huang, L., Li, Y., Singleton, A., Hardy, J., Abecasis, G., Rosenberg, N. and Scheet, P. (2009). Genotype-imputation accuracy across worldwide human populations. American Journal of Human Genetics 84 235–250.
  • Hudson, R. (2001). Two-locus sampling distributions and their application. Genetics 159 1805–1817.
  • Li, N. and Stephens, M. (2003). Modelling linkage disequilibrium and identifying recombination hotspots using snp data. Genetics 165 2213–2233.
  • Li, Y., Ding, J. and Abecasis, G. (2006). Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of Human Genetics 79 S2290.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics 39 906–913.
  • McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
  • McVean, G., Awadalla, P. and Fearnhead, P. (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160 1231–1241.
  • Meaburn, E., Butcher, L., Schalkwyk, L. and Plomin, R. (2006). Genotyping pooled DNA using 100k snp microarrays: A step towards genomewide association scans. Nucleic Acids Research 34 e28.
  • Meng, X. and Rubin, D. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • Nicolae, D. (2006a). Quantifying the amount of missing information in genetic association studies. Genetic Epidemiology 30 703–717.
  • Nicolae, D. (2006b). Testing untyped alleles (tuna)-applications to genome-wide association studies. Genetic Epidemiology 30 718–727.
  • Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M. and Sham, P. (2007). Plink: A toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics 81 559–575.
  • Sankararaman, S., Obozinski, G., Jordan, M. and Halperin, E. (2009). Genomic privacy and limits of individual detection in a pool. Nature Genetics 41 965–967. Epub.
  • Scheet, P. and Stephens, M. (2005). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotype phase. American Journal of Human Genetics 78 629–644.
  • Servin, B. and Stephens, M. (2008). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genetics 3 e114.
  • Stephens, M. and Scheet, P. (2005). Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. American Journal of Human Genetics 76 449–462.
  • Stephens, M., Smith, N. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics 68 978–989.
  • The International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437 1299–1320.
  • Weir, B. (1979). Inferences about linkage disequilibrium. Biometrics 35 235–254.
  • Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models, 2nd ed. Springer, New York.
  • Yu, Z. and Schaid, D. (2007). Methods to impute missing genotypes for population data. Human Genetics 122 495–504.
  • Zeggini, E., Scott, L., Saxena, R., Voight, B., Marchini, J., Hu, T., de Bakker, P., Abecasis, G., Almgren, P., Andersen, G., Ardlie, K., Boström, K., Bergman, R., Bonnycastle, L., Borch-Johnsen, K., Burtt, N., Chen, H., Chines, P., Daly, M., Deodhar, P., Ding, C., Doney, A., Duren, W., Elliott, K., Erdos, M., Frayling, T., Freathy, R., Gianniny, L., Grallert, H., Grarup, N., Groves, C., Guiducci, C., Hansen, T., Herder, C., Hitman, G., Hughes, T., Isomaa, B., Jackson, A., Jørgensen, T., Kong, A., Kubalanza, K., Kuruvilla, F., Kuusisto, J., Langenberg, C., Lango, H., Lauritzen, T., Li, Y., Lindgren, C., Lyssenko, V., Marvelle, A., Meisinger, C., Midthjell, K., Mohlke, K., Morken, M., Morris, A., Narisu, N., Nilsson, P., Owen, K., Palmer, C., Payne, F., Perry, J., Pettersen, E., Platou, C., Prokopenko, I., Qi, L., Qin, L., Rayner, N., Rees, M., Roix, J., Sandbaek, A., Shields, B., Sjögren, M., Steinthorsdottir, V., Stringham, H., Swift, A., Thorleifsson, G., Thorsteinsdottir, U., Timpson, N., Tuomi, T., Tuomilehto, J., Walker, M., Watanabe, R., Weedon, M., CJ, C. W., Wellcome Trust Case Control Consortium, Illig, T., Hveem, K., Hu, F., Laakso, M., Stefansson, K., Pedersen, O., Wareham, N., Barroso, I., Hattersley, A., Collins, F., Groop, L., McCarthy, M., Boehnke, M. and Altshuler, D. (2008). Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genetics 40 638–645.