The Annals of Applied Statistics

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

Xiang Zhu and Matthew Stephens

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at

Article information

Ann. Appl. Stat., Volume 11, Number 3 (2017), 1561-1592.

Received: March 2016
Revised: April 2017
First available in Project Euclid: 5 October 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Summary statistics Bayesian regression genome wide association study multiple-SNP analysis variable selection heritability explained variation Markov chain Monte Carlo


Zhu, Xiang; Stephens, Matthew. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann. Appl. Stat. 11 (2017), no. 3, 1561--1592. doi:10.1214/17-AOAS1046.

Export citation


  • 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467 1061–1073.
  • Aqeilan, R. I., Hassan, M. Q., de Bruin, A., Hagan, J. P., Volinia, S., Palumbo, T., Hussain, S., Lee, S.-H., Gaur, T., Stein, G. S. et al. (2008). The WWOX tumor suppressor is essential for postnatal survival and normal bone metabolism. J. Biol. Chem. 283 21629–21639.
  • Boos, D. D. (1985). A converse to Scheffé’s theorem. Ann. Statist. 13 423–427.
  • Bulik-Sullivan, B., Loh, P.-R., Finucane, H., Ripke, S., Yang, J., Psychiatric Genomics Consortium Schizophrenia Working Group, Patterson, N., Daly, M. J., Price, A. L. and Neale, B. M. (2015). LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47 291–295.
  • Carbonetto, P. and Stephens, M. (2013). Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in crohn’s disease. PLoS Genet. 9 e1003770.
  • Casella, G. and Robert, C. P. (1996). Rao–Blackwellisation of sampling schemes. Biometrika 83 81–94.
  • Chen, W., Larrabee, B. R., Ovsyannikova, I. G., Kennedy, R. B., Haralambieva, I. H., Poland, G. A. and Schaid, D. J. (2015). Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics 200 719–736.
  • Del Mare, S., Kurek, K. C., Stein, G. S., Lian, J. B. and Aqeilan, R. I. (2011). Role of the WWOX tumor suppressor gene in bone homeostasis and the pathogenesis of osteosarcoma. Am. J. Cancer Res. 1 585.
  • Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
  • Donnelly, P. (2008). Progress and challenges in genome-wide association studies in humans. Nature 456 728–731.
  • Efron, B. (1993). Bayes and likelihood calculations from confidence intervals. Biometrika 80 3–26.
  • Ehret, G. B., Lamparter, D., Hoggart, C. J., Whittaker, J. C., Beckmann, J. S., Kutalik, Z., Genetic Investigation of Anthropometric Traits Consortium et al. (2012). A multi-SNP locus-association method reveals a substantial fraction of the missing heritability. Am. J. Hum. Genet. 91 863–871.
  • Evangelou, E. and Ioannidis, J. P. (2013). Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14 379–389.
  • Finucane, H. K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P.-R., Anttila, V., Xu, H., Zang, C., Farh, K. et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47 1228–1235.
  • Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs, R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M. et al. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449 851–861.
  • Global Lipids Genetics Consortium (2013). Discovery and refinement of loci associated with lipids levels. Nat. Genet. 45 1274–1283.
  • Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association mapping. PLoS Genet. 4 e1000279.
  • Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for Genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5 1780–1815.
  • Guan, Y. and Wang, K. (2013). Whole-genome multi-SNP-phenotype association analysis. In Advances in Statistical Bioinformatics 224–243. Cambridge Univ. Press, Cambridge.
  • Gusev, A., Lee, S. H., Trynka, G., Finucane, H., Vilhjálmsson, B. J., Xu, H., Zang, C., Ripke, S., Bulik-Sullivan, B., Stahl, E., Kähler, A. K., Hultman, C. M., Purcell, S. M., McCarroll, S. A., Daly, M., Pasaniuc, B., Sullivan, P. F., Neale, B. M., Wray, N. R., Raychaudhuri, S. and Price, A. L. (2014). Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95 535–552.
  • Hoggart, C. J., Whittaker, J. C., Iorio, M. D. and Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4 e1000130.
  • Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. and Eskin, E. (2014). Identifying causal variants at loci with multiple signals of association. Genetics 198 497–508.
  • Iioka, T., Furukawa, K., Yamaguchi, A., Shindo, H., Yamashita, S. and Tsukazaki, T. (2003). P300/CBP acts as a coactivator to cartilage homeoprotein-1 (Cart1), paired-like homeoprotein, through acetylation of the conserved lysine residue adjacent to the homeodomain. J. Bone Miner. Res. 18 1419–1429.
  • Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C., Eskin, E. et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42 348–354.
  • Kurkó, J., Besenyei, T., Laki, J., Glant, T. T., Mikecz, K. and Szekanecz, Z. (2013). Genetics of rheumatoid arthritis—A comprehensive review. Clin. Rev. Allergy Immunol. 45 170–179.
  • Lee, D., Bigdeli, T. B., Riley, B. P., Fanous, A. H. and Bacanu, S.-A. (2013). DIST: Direct imputation of summary statistics for unmeasured SNPs. Bioinformatics 29 2925–2927.
  • Lee, D., Williamson, V. S., Bigdeli, T. B., Riley, B. P., Fanous, A. H., Vladimirov, V. I. and Bacanu, S.-A. (2015). JEPEG: A summary statistics based tool for gene-level joint testing of functional variants. Bioinformatics 31 1176–1182.
  • Li, D., Sakuma, R., Vakili, N. A., Mo, R., Puviindran, V., Deimling, S., Zhang, X., Hopyan, S. and Hui, C.-c. (2014). Formation of proximal and anterior limb skeleton requires early function of Irx3 and Irx5 and is negatively regulated by Shh signaling. Dev. Cell 29 233–240.
  • Liao, W.-J., Tsao, K.-C. and Yang, R.-B. (2016). Electrostatics and N-glycan-mediated membrane tethering of SCUBE1 is critical for promoting bone morphogenetic protein signalling. Biochem. J. 473 661–672.
  • Lin, D. (2005). An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21 781–787.
  • Liu, J. Z., Mcrae, A. F., Nyholt, D. R., Medland, S. E., Wray, N. R., Brown, K. M., Hayward, N. K., Montgomery, G. W., Visscher, P. M., Martin, N. G. et al. (2010). A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87 139–145.
  • Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B., Patterson, N. et al. (2015). Efficient Bayesian mixed model analysis increases association power in large cohorts. Nat. Genet. 47 284–290.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39 906–913.
  • McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 9 356–369.
  • Moser, G., Lee, S. H., Hayes, B. J., Goddard, M. E., Wray, N. R. and Visscher, P. M. (2015). Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 11 e1004969.
  • Nature Genetics (2012). Asking for more. Nat. Genet. 44 733.
  • Newcombe, J., Conti, V. and Richardson, S. (2016). JAM: a scalable bayesian framework for joint analysis of marginal SNP effects. Genet. Epidemiol. 40 188–201.
  • Palla, L. and Dudbridge, F. (2015). A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. Am. J. Hum. Genet. 97 250–259.
  • Park, J.-H., Wacholder, S., Gail, M. H., Peters, U., Jacobs, K. B., Chanock, S. J. and Chatterjee, N. (2010). Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42 570–575.
  • Peise, E., Fabregat-Traver, D. and Bientinesi, P. (2015). High performance solutions for big-data GWAS. Parallel Comput. 42 75–87.
  • Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 94 559–573.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • Price, A. L., Zaitlen, N. A., Reich, D. and Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11 459–463.
  • Pritchard, J. K. and Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69 1–14.
  • Sabatti, C. (2013). Multivariate linear models for GWAS. In Advances in Statistical Bioinformatics (K.-A. Do, Z. S. Qin and M. Vannucci, eds.) 188–207. Cambridge Univ. Press, Cambridge.
  • Seaman, S. R. and Müller-Myhsok, B. (2005). Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. Am. J. Hum. Genet. 76 399–408.
  • Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
  • Smedley, D., Haider, S., Durinck, S., Pandini, L., Provero, P., Allen, J., Arnaiz, O., Awedh, M. H., Baldock, R., Barbiera, G. et al. (2015). The BioMart community portal: An innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43 W589–W598.
  • Stephens, M. (2013). A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8 e65245.
  • Stephens, M. (2017). False discovery rates: A new deal. Biostatistics 18 275–294.
  • Sweeting, T. J. (1986). On a converse to Scheffé’s theorem. Ann. Statist. 14 1252–1256.
  • Vilhjalmsson, B., Yang, J., Finucane, H. K., Gusev, A., Lindstrom, S., Ripke, S., Genovese, G., Loh, P.-R., Bhatia, G., Do, R. et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97 576–592.
  • Visscher, P. M., Hill, W. G. and Wray, N. R. (2008). Heritability in the genomics era—Concepts and misconceptions. Nat. Rev. Genet. 9 255–266.
  • Wakefield, J. (2009). Bayes factors for genome-wide association studies: Comparison with P-values. Genet. Epidemiol. 33 79–86.
  • Wang, K., Li, M. and Hakonarson, H. (2010). Analysing biological pathways in genome-wide association studies. Nat. Rev. Genet. 11 843–854.
  • Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447 661–678.
  • Wen, X. and Stephens, M. (2010). Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 4 1158–1182.
  • Wen, X. and Stephens, M. (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene-environment interactions. Ann. Appl. Stat. 8 176–203.
  • Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T. H., Gustafsson, S., Chu, A. Y., Estrada, K., Luan, J., Kutalik, Z. et al. (2014). Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46 1173–1186.
  • Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W. et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 565–569.
  • Yang, J., Manolio, T. A., Pasquale, L. R., Boerwinkle, E., Caporaso, N., Cunningham, J. M., de Andrade, M., Feenstra, B., Feingold, E., Hayes, M. G. et al. (2011). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43 519–525.
  • Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Weedon, M. N., Loos, R. J. et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44 369–375.
  • Zhang, H., Wheeler, W., Hyland, P. L., Yang, Y., Shi, J., Chatterjee, N. and Yu, K. (2016). A powerful procedure for pathway-based meta-analysis using summary statistics identifies 43 pathways associated with type II diabetes in European populations. PLoS Genet. 12 e1006122.
  • Zhou, X., Carbonetto, P. and Stephens, M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9 e1003264.
  • Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44 821–824.
  • Zhu, X. and Stephens, M. (2017). Supplement to “Bayesian large-scale multiple regression with summary statistics from genome-wide association studies”. DOI:10.1214/17-AOAS1046SUPP.

Supplemental materials

  • Supplement to “Bayesian large-scale multiple regression with summary statistics from genome-wide association studies”. This file contains all the Appendices, Supplementary Tables and Figures referenced in the main text.