The Annals of Applied Statistics

Gene-proximity models for genome-wide association studies

Ian Johnston, Timothy Hancock, Hiroshi Mamitsuka, and Luis Carvalho

Full-text: Open access


Motivated by the important problem of detecting association between genetic markers and binary traits in genome-wide association studies, we present a novel Bayesian model that establishes a hierarchy between markers and genes by defining weights according to gene lengths and distances from genes to markers. The proposed hierarchical model uses these weights to define unique prior probabilities of association for markers based on their proximities to genes that are believed to be relevant to the trait of interest. We use an expectation-maximization algorithm in a filtering step to first reduce the dimensionality of the data and then sample from the posterior distribution of the model parameters to estimate posterior probabilities of association for the markers. We offer practical and meaningful guidelines for the selection of the model tuning parameters and propose a pipeline that exploits a singular value decomposition on the raw data to make our model run efficiently on large data sets. We demonstrate the performance of the model in simulation studies and conclude by discussing the results of a case study using a real-world data set provided by the Wellcome Trust Case Control Consortium.

Article information

Ann. Appl. Stat., Volume 10, Number 3 (2016), 1217-1244.

Received: October 2013
Revised: September 2015
First available in Project Euclid: 28 September 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Large $p$ small $n$ hierarchical Bayes Pólya–Gamma latent variable


Johnston, Ian; Hancock, Timothy; Mamitsuka, Hiroshi; Carvalho, Luis. Gene-proximity models for genome-wide association studies. Ann. Appl. Stat. 10 (2016), no. 3, 1217--1244. doi:10.1214/16-AOAS907.

Export citation


  • 1000 Genomes Project Consortium et al. (2012). An integrated map of genetic variation from 1092 human genomes. Nature 491 56–65.
  • Al-Mubaid, H. and Singh, R. K. (2010). A text-mining technique for extracting gene-disease associations from the biomedical literature. International Journal of Bioinformatics Research and Applications 6 270–286.
  • Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7 781–791.
  • Bansal, V., Libiger, O., Torkamani, A. and Schork, N. J. (2010). Statistical analysis strategies for association studies involving rare variants. Nat. Rev. Genet. 11 773–785.
  • Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist. 32 870–897.
  • Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
  • Burton, P. R., Clayton, D. G., Cardon, L. R., Craddock, N., Deloukas, P., Duncanson, A., Kwiatkowski, D. P., McCarthy, M. I., Ouwehand, W. H., Samani, N. J. et al. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447 661–678.
  • Carvalho, L. and Lawrence, C. (2008). Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. USA 105 3209–3214.
  • Cowles, M. K. and Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Amer. Statist. Assoc. 91 883–904.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
  • Evangelou, E. and Ioannidis, J. P. (2013). Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14 379–389.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Stanford Univ., Stanford, CA. Available at arXiv:1001.0736.
  • Gelfand, A. E. and Ghosh, S. K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika 85 1–11.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • Grabiec, A. M., Angiolilli, C., Hartkamp, L. M., van Baarsen, L. G., Tak, P. P. and Reedquist, K. A. (2014). JNK-dependent downregulation of FoxO1 is required to promote the survival of fibroblast-like synoviocytes in rheumatoid arthritis. Annals of the Rheumatic Diseases 74 annrheumdis–2013.
  • Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for Genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5 1780–1815.
  • Habier, D., Fernando, R., Kizilkaya, K. and Garric, D. (2011). Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12 186.
  • Hamada, M. and Asai, K. (2012). A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J. Comput. Biol. 19 532–549.
  • Haupt, J., Castro, R. M. and Nowak, R. (2011). Distilled sensing: Adaptive sampling for sparse detection and estimation. IEEE Trans. Inform. Theory 57 6222–6235.
  • Heard, E., Tishkoff, S., Todd, J. A., Vidal, M., Wagner, G. P., Wang, J., Weigel, D. and Young, R. (2010). Ten years of genetics and genomics: What have we achieved and where are we heading? Nat. Rev. Genet. 11 723–733.
  • Hoerl, A. and Kennard, R. (1970). Ridge regression—Applications to nonorthogonal problems. Technometrics 12 69–82.
  • Hoffman, G. E., Logsdon, B. A. and Mezey, J. G. (2013). Puma: A unified framework for penalized multiple regression analysis of gwas data. PLoS Comput. Biol. 9 e1003101.
  • Ioannidis, J. P., Thomas, G. and Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nat. Rev. Genet. 10 318–329.
  • Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Statist. 33 730–773.
  • Johnston, I., Hancock, T., Mamitsuka, H. and Carvalho, L. (2016). Supplement to “Gene-proximity models for genome-wide association studies.” DOI:10.1214/16-AOAS907SUPP.
  • Jorgenson, E. and Witte, J. S. (2006). A gene-centric approach to genome-wide association studies. Nat. Rev. Genet. 7 885–891.
  • Kooperberg, C., LeBlanc, M. and Obenchain, V. (2010). Risk prediction using genome-wide association studies. Genetic Epidemiology 34 643–652.
  • MacCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models 37. Chapman and Hall/CRC press, London.
  • MalaCards (2014). Genes related to rheumatoid arthritis. Available at [Online. accessed 2014-10-01].
  • McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman & Hall, London.
  • Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80 267–278.
  • Michou, L., Lasbleiz, S., Rat, A.-C., Migliorini, P., Balsa, A., Westhovens, R., Barrera, P., Alves, H., Pierlot, C., Glikmans, E. et al. (2007). Linkage proof for ptpn22, a rheumatoid arthritis susceptibility gene and a human autoimmunity gene. Proc. Natl. Acad. Sci. USA 104 1649–1654.
  • Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023–1036.
  • Peng, B., Zhu, D., Ander, B. P., Zhang, X., Xue, F., Sharp, F. R. and Yang, X. (2013). An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PloS One 8 e67672.
  • Petersen, K. B. and Pedersen, M. S. (2012). The matrix cookbook. Technical University of Denmark.
  • Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Amer. Statist. Assoc. 108 1339–1349.
  • Pritchard, J. and Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. American Journal of Human Genetics 69 1–14.
  • Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681–690.
  • Technology Department Carnegie Library of Pittsburgh (2002). In The Handy Science Answer Book. Visible Ink Press.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 91–108.
  • Wang, W. Y., Barratt, B. J., Clayton, D. G. and Todd, J. A. (2005). Genome-wide association studies: Theoretical and practical concerns. Nat. Rev. Genet. 6 109–118.
  • West, M. (2003). Bayesian factor regression models in the “large $p$, small $n$” paradigm. In Bayesian Statistics, 7 (Tenerife, 2002) 733–742. Oxford Univ. Press, New York.
  • Whittemore, A. S. (2007). A Bayesian false discovery rate for multiple testing. J. Appl. Stat. 34 1–9.
  • Wigginton, J. E., Cutler, D. J. and Abecasis, G. R. (2005). A note on exact tests of Hardy-weinberg equilibrium. The American Journal of Human Genetics 76 887–893.
  • Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J. and Lin, X. (2010). Powerful snp-set analysis for case-control genome-wide association studies. The American Journal of Human Genetics 86 929–942.
  • Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89 82–93.
  • Zhou, X., Carbonetto, P. and Stephens, M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9 e1003264.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

Supplemental materials

  • Extended results tables and figures. We provide figures and tables to summarize the results of additional simulation studies with less stringent effect sizes as well as the findings in our case study.