The Annals of Applied Statistics

Detection boundary and Higher Criticism approach for rare and weak genetic effects

Zheyang Wu, Yiming Sun, Shiquan He, Judy Cho, Hongyu Zhao, and Jiashun Jin

Full-text: Open access

Abstract

Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits’ genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on Tukey’s Higher Criticism (Tukey [The Higher Criticism (1976) Princeton Univ.]; Donoho and Jin [Ann. Statist. 32 (2004) 962–994]) to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn’s disease.

Article information

Source
Ann. Appl. Stat., Volume 8, Number 2 (2014), 824-851.

Dates
First available in Project Euclid: 1 July 2014

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1404229516

Digital Object Identifier
doi:10.1214/14-AOAS724

Mathematical Reviews number (MathSciNet)
MR3262536

Zentralblatt MATH identifier
06333778

Keywords
Multiple hypotheses testing large-scale inference detection boundary Higher Criticism rare and weak effects statistical power genome-wide association studies SNP-set methods

Citation

Wu, Zheyang; Sun, Yiming; He, Shiquan; Cho, Judy; Zhao, Hongyu; Jin, Jiashun. Detection boundary and Higher Criticism approach for rare and weak genetic effects. Ann. Appl. Stat. 8 (2014), no. 2, 824--851. doi:10.1214/14-AOAS724. https://projecteuclid.org/euclid.aoas/1404229516


Export citation

References

  • Ansorge, W. J. (2009). Next-generation DNA sequencing techniques. N. Biotechnol. 25 195–203.
  • Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the Higher Criticism. Ann. Statist. 39 2533–2556.
  • Arnon, T. I., Xu, Y., Lo, C., Pham, T., An, J., Coughlin, S., Dorn, G. W. and Cyster, J. G. (2011). GRK2-dependent S1PR1 desensitization is required for lymphocytes to overcome their attraction to blood. Science Signalling 333 1898.
  • Ayers, K. L. and Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 34 879–891.
  • Ballard, D. H., Cho, J. and Zhao, H. (2010). Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genet. Epidemiol. 34 201–212.
  • Baumgart, D. C. and Sandborn, W. J. (2007). Inflammatory bowel disease: Clinical aspects and established and evolving therapies. Lancet 369 1641–1657.
  • Baumgart, D. C. and Sandborn, W. J. (2012). Crohn’s disease. Lancet 380 1590–1605.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289–300.
  • Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C. and Apweiler, R. (2009). QuickGO: A web-based tool for Gene Ontology searching. Bioinformatics 25 3045–3046.
  • Brandtzaeg, P. and Pabst, R. (2004). Let’s go mucosal: Communication on slippery ground. Trends Immunol. 25 570–577.
  • By, K. and Qaqish, B. (2011). mvtBinaryEP: Generates correlated binary data (R package).
  • Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
  • Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
  • Duerr, R. H., Taylor, K. D., Brant, S. R., Rioux, J. D., Silverberg, M. S., Daly, M. J., Steinhart, A. H., Abraham, C., Regueiro, M., Griffiths, A. et al. (2006). A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science Signalling 314 1461.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2007a). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103.
  • Efron, B. (2007b). Size, power and false discovery rates. Ann. Statist. 35 1351–1377.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Emrich, L. J. and Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. Amer. Statist. 45 302–304.
  • Falconer, D. S., Mackay, T. F. C. and Frankham, R. (1996). Introduction to quantitative genetics (4th edition). Trends in Genetics 12 280.
  • Franke, A., McGovern, D. P. B., Barrett, J. C., Wang, K., Radford-Smith, G. L., Ahmad, T., Lees, C. W., Balschun, T., Lee, J., Roberts, R. et al. (2010). Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genetics 42 1118–1125.
  • Genovese, C., Jin, J. and Wasserman, L. (2009). Revisiting marginal regression. Preprint. Available at arXiv:0911.4080v1.
  • Goldstein, D. B. (2009). Common genetic variation and human traits. N. Engl. J. Med. 360 1696–1698.
  • Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5 1780–1815.
  • Hall, P. and Jin, J. (2008). Properties of higher criticism under strong dependence. Ann. Statist. 36 381–402.
  • Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
  • Hall, P., Jin, J. and Miller, H. (2009). Feature selection when there are many influential features. Preprint. Available at arXiv:0911.4076.
  • He, S. and Wu, Z. (2011). Gene-based Higher Criticism methods for large-scale exonic single-nucleotide polymorphism data. BMC Proceedings 5 S65.
  • Hoggart, C. J., Whittaker, J. C., Iorio, M. D. and Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics 4 e1000130.
  • Hoh, J. and Ott, J. (2003). Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 4 701–709.
  • Hoh, J., Wille, A. and Ott, J. (2001). Trimming, weighting, and grouping SNPs in human case–control association studies. Genome Res. 11 2115–2119.
  • Ingster, Y. I. (2002). Adaptive detection of a signal of growing dimension. II. Math. Methods Statist. 11 37–68.
  • Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression. Electron. J. Stat. 4 1476–1526.
  • Jin, J. and Wang, L. (2013). Spectral clustering by Higher Criticism Thresholding. Unpublished manuscript.
  • Kraft, P. and Hunter, D. J. (2009). Genetic risk prediction—Are we there yet? New England Journal of Medicine 360 1701.
  • Li, M., Wang, K., Grant, S. F. A., Hakonarson, H. and Li, C. (2009). ATOM: A powerful gene-based association test by combining optimally weighted markers. Bioinformatics 25 497–503.
  • Liu, D., Lin, X. and Ghosh, D. (2007). Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics 63 1079–1088, 1311.
  • Loftus, E. V., Schoenfeld, P. and Sandborn, W. J. (2002). The epidemiology and natural history of Crohn’s disease in population-based patient cohorts from North America: A systematic review. Alimentary Pharmacology & Therapeutics 16 51–60.
  • Luo, L., Peng, G., Zhu, Y., Dong, H., Amos, C. I. and Xiong, M. (2010). Genome-wide gene and pathway analysis. Eur. J. Hum. Genet. 18 1045–1053.
  • Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9 387–402.
  • McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A. and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 9 356–369.
  • Mendel, G. (1866). Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschenden Vereines in Brünn, Bd. IV for das Jahr 1865, Abhandlungen, 3–47. Genetic Theory 295 3–47.
  • Metzker, M. L. (2010). Sequencing technologies—The next generation. Nat. Rev. Genet. 11 31–46.
  • Mukhopadhyay, I., Feingold, E., Weeks, D. E. and Thalamuthu, A. (2010). Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet. Epidemiol. 34 213–221.
  • Pearson, K. (1904). Mathematical contributions to the theory of evolution. XII. On a generalised theory of alternative inheritance, with special reference to Mendel’s laws. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 203 53–86.
  • Peng, G., Luo, L., Siu, H., Zhu, Y., Hu, P., Hong, S., Zhao, J., Zhou, X., Reveille, J. D. and Jin, L. (2009). Gene and pathway-based second-wave analysis of genome-wide association studies. European Journal of Human Genetics 18 111–117.
  • The UniProt Consortium (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic. Acids Res. 40 D71–D75.
  • Tukey, J. W. (1976). The higher criticism. Course Notes, Statistics 411, Princeton Univ.
  • Wade, N. (2009). Genes show limited value in predicting diseases. New York Times April 16.
  • Wallukat, G., Homuth, V., Fischer, T., Lindschau, C., Horstkamp, B., Jüpner, A., Baur, E., Nissen, E., Vetter, K., Neichel, D. et al. (1999). Patients with preeclampsia develop agonistic autoantibodies against the angiotensin AT$_{1}$ receptor. Journal of Clinical Investigation 103 945–952.
  • Wang, K. and Abbott, D. (2008). A principal components regression approach to multilocus genetic association studies. Genet. Epidemiol. 32 108–118.
  • Wang, K., Li, M. and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81 1278–1283.
  • Wellner, J. A. (1978). Limit theorems for the ratio of the empirical distribution function to the true distribution function. Z. Wahrsch. Verw. Gebiete 45 73–88.
  • Wu, Z. and Zhao, H. (2009). Statistical power of model selection strategies for genome-wide association studies. PLoS Genet. 5 e1000582.
  • Wu, Z. and Zhao, H. (2012). On model selection strategies to identify genes underlying binary traits using genome-wide association data. Statist. Sinica 22 1041–1074.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
  • Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J. and Lin, X. (2010). Powerful SNP-set analysis for case–control genome-wide association studies. Am. J. Hum. Genet. 86 929–942.
  • Wu, Z., Sun, Y., He, S., Cho, J. H., Zhao, H. and Jin, J. (2014). Supplement to “Detection boundary and Higher Criticism approach for rare and weak genetic effects.” DOI:10.1214/14-AOAS724SUPP.
  • Xie, J., Cai, T. T. and Li, H. (2011). Sample size and power analysis for sparse signal recovery in genome-wide association studies. Biometrika 98 273–290.
  • Yang, H. C., Hsieh, H. Y. and Fann, C. S. J. (2008). Kernel-based association test. Genetics 179 1057–1068.
  • Yu, K., Li, Q., Bergen, A. W., Pfeiffer, R. M., Rosenberg, P. S., Caporaso, N., Kraft, P. and Chatterjee, N. (2009). Pathway analysis by adaptive combination of $P$-values. Genet. Epidemiol. 33 700–709.
  • Yulh, G. U. (1902). Mendel’s laws and their probable relations to intra-racial heredity. The New Phytologist 1 193–207.
  • Zhang, D. and Lin, X. (2003). Hypothesis testing in semiparametric additive mixed models. Biostatistics 4 57–74.
  • Zuo, Y., Zou, G. and Zhao, H. (2006). Two-stage designs in case–control association analysis. Genetics 173 1747–1760.

Supplemental materials

  • Supplementary material: Supplement to “Detection boundary and Higher Criticism approach for rare and weak genetic effect”. We provide the proofs for main theoretical results, the fundamental lemmas and their proofs, as well as additional figures and tables that show performance of Higher Criticism in comparing with other methods under a variety of setups.