The Annals of Applied Statistics

Refining genetically inferred relationships using treelet covariance smoothing

Andrew Crossett, Ann B. Lee, Lambertus Klei, Bernie Devlin, and Kathryn Roeder

Full-text: Open access


Recent technological advances coupled with large sample sets have uncovered many factors underlying the genetic basis of traits and the predisposition to complex disease, but much is left to discover. A common thread to most genetic investigations is familial relationships. Close relatives can be identified from family records, and more distant relatives can be inferred from large panels of genetic markers. Unfortunately these empirical estimates can be noisy, especially regarding distant relatives. We propose a new method for denoising genetically—inferred relationship matrices by exploiting the underlying structure due to hierarchical groupings of correlated individuals. The approach, which we call Treelet Covariance Smoothing, employs a multiscale decomposition of covariance matrices to improve estimates of pairwise relationships. On both simulated and real data, we show that smoothing leads to better estimates of the relatedness amongst distantly related individuals. We illustrate our method with a large genome-wide association study and estimate the “heritability” of body mass index quite accurately. Traditionally heritability, defined as the fraction of the total trait variance attributable to additive genetic effects, is estimated from samples of closely related individuals using random effects models. We show that by using smoothed relationship matrices we can estimate heritability using population-based samples. Finally, while our methods have been developed for refining genetic relationship matrices and improving estimates of heritability, they have much broader potential application in statistics. Most notably, for error-in-variables random effects models and settings that require regularization of matrices with block or hierarchical structure.

Article information

Ann. Appl. Stat., Volume 7, Number 2 (2013), 669-690.

First available in Project Euclid: 27 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Covariance estimation cryptic relatedness genome-wide association heritability kinship


Crossett, Andrew; Lee, Ann B.; Klei, Lambertus; Devlin, Bernie; Roeder, Kathryn. Refining genetically inferred relationships using treelet covariance smoothing. Ann. Appl. Stat. 7 (2013), no. 2, 669--690. doi:10.1214/12-AOAS598.

Export citation


  • Albers, C. A., Stankovich, J., Thomson, R., Bahlo, M. and Kappen, H. J. (2008). Multipoint approximations of identity-by-descent probabilities for accurate linkage analysis of distantly related individuals. Am. J. Hum. Genet. 82 607–622.
  • Allison, D., Kaprio, J., Korkeila, M., Koskenvuo, M., Neale, M., Hayakawa, K. et al. (1996). The heritability of body mass index among an international sample of monozygotic twins reared apart. International Journal of Obesity 20 501–506.
  • Almasy, L. and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62 1198–1211.
  • Anderson, A. D. and Weir, B. S. (2007). A maximum likelihood method for estimation of pairwise relatedness in structured populations. Genetics 176 421–420.
  • Astle, W. and Balding, D. J. (2009). Population structure and cryptic relatedness in genetic association studies. Statist. Sci. 24 451–471.
  • Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • Boehnke, M. and Cox, N. J. (1997). Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61 423–429.
  • Bravo, H. C., Lee, K. E., Klein, B. E. K., Klein, R., Iyengar, S. K. and Wahba, G. (2009). Examining the relative influence of familial, genetic, and environmental covariate information in flexible risk models. Proc. Natl. Acad. Sci. USA 106 8128.
  • Browning, S. R. (2008). Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics 178 2123.
  • Browning, S. R. and Browning, B. L. (2010). High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 86 526–539.
  • Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 106 672–684.
  • Choi, Y., Wijsman, E. M. and Weir, B. S. (2009). Case-control association testing in the presence of unknown relationships. Genet. Epidemiol. 33 668–678.
  • Day-Williams, A. G., Blangero, J., Dyer, T. D., Lange, K. and Sobel, E. M. (2011). Linkage analysis without defined pedigrees. Genet. Epidemiol. 35 360–370.
  • Deary, I. J., Yang, J., Davies, G., Harris, S. E., Tenesa, A., Liewald, D., Luciano, M., Lopez, L. M., Gow, A. J., Corley, J. et al. (2012). Genetic contributions to stability and change in intelligence from childhood to old age. Nature 482 212–215.
  • Devlin, B., Daniels, M. and Roeder, K. (1997). The heritability of IQ. Nature 388 468–471.
  • Eding, H. et al. (2001). Marker-based estimates of between and within population kinships for the conservation of genetic diversity. Journal of Animal Breeding and Genetics 118 141–159.
  • Epstein, M. P., Duren, W. L. and Boehnke, M. (2000). Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67 1219–1231.
  • Fisher, R. A. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh 52 399–433.
  • Hallmayer, J., Cleveland, S., Torres, A., Phillips, J., Cohen, B., Torigoe, T., Miller, J., Fedele, A., Collins, J., Smith, K., Lotspeich, L., Croen, L. A., Ozonoff, S., Lajonchere, C., Grether, J. K. and Risch, N. (2011). Genetic heritability and shared environmental factors among twin pairs with autism. Arch. Gen. Psychiatry 68 1095–1102.
  • Hayes, B. J. and Goddard, M. E. (2008). Technical note: Prediction of breeding values using marker-derived relationship matrices. J. Anim. Sci. 86 2089–2092.
  • Henderson, C. R. (1950). Estimation of genetic parameters. Biometrics 6 186–187.
  • Hopper, J. L. and Mathews, J. D. (1982). Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 46 373–383.
  • Imielinski, M., Baldassano, R. N., Griffiths, A., Russell, R. K., Annese, V., Dubinsky, M., Kugathasan, S., Bradfield, J. P., Walters, T. D., Sleiman, P. et al. (2009). Common variants at five new loci associated with early-onset inflammatory bowel disease. Nature Genetics 41 1335–1340.
  • Kang, H. M., Sul, J. H., Zaitlen, N. A., Kong, S., Freimer, N. B., Sabatti, C., Eskin, E. et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42 348–354.
  • Kangas-Kontio, T., Huotari, A., Ruotsalainen, H., Herzig, K.-H., Tamminen, M., Ala-Korpela, M., Savolainen, M. J. and Kakko, S. (2010). Genetic and environmental determinants of total and high-molecular weight adiponectin in families with low HDL-cholesterol and early onset coronary heart disease. Atherosclerosis 210 479–485.
  • Katzmarzyk, P., Perusse, L. and Bouchard, C. (1999). Genetics of abdominal vesceral fat levels. American Journal of Human Biology 11 225–235.
  • Lander, E. S. and Schork, N. J. (1994). Genetic dissection of complex traits. Science 265 2037–2048.
  • Lee, A. B. and Nadler, B. (2007). Treelets—a tool for dimensionality reduction and multi-scale analysis of unstructured data. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (M. Meila and X. Shen, eds.). JMLR WCP 2 259–266.
  • Lee, A. B., Nadler, B. and Wasserman, L. (2008). Treelets—an adaptive multi-scale basis for sparse unordered data. Ann. Appl. Stat. 2 435–471.
  • Lynch, M. and Ritland, K. (1999). Estimation of pairwise relatedness with molecular markers. Genetics 152 1753–1766.
  • McGovern, D. P. B., Gardet, A., Törkvist, L., Goyette, P., Essers, J., Taylor, K. D., Neale, B. M., Ong, R. T. H., Lagacé, C., Li, C. et al. (2010). Genome-wide association identifies multiple ulcerative colitis susceptibility loci. Nature Genetics 42 332–337.
  • McPeek, M. S. and Sun, L. (2000). Statistical tests for detection of misspecified relationships by use of genome-screen data. The American Journal of Human Genetics 66 1076–1094.
  • Milligan, B. G. (2003). Maximum-likelihood estimation of relatedness. Genetics 163 1153–1167.
  • Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., De Bakker, P. I. W., Daly, M. J. et al. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81 559–575.
  • Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components. Wiley, New York.
  • Thompson, E. A. (1974). Gene identities and multiple relationships. Biometrics 30 667–680.
  • Thompson, E. A. (1975). The estimation of pairwise relationships. Ann. Human Genetics 39 173–188.
  • Thompson, E. A. (1986). Pedigree Analysis in Human Genetics. Johns Hopkins Univ. Press, Baltimore, MD.
  • Thornton, T. and McPeek, M. S. (2010). ROADTRIPS: Case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 86 172–184.
  • Visscher, P. M., Medland, S. E., Ferreira, M. A. R., Morley, K. I., Zhu, G., Cornes, B. K., Montgomery, G. W. and Martin, N. G. (2006). Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2 e41.
  • Visscher, P. M., Brown, M. A., McCarthy, M. I. and Yang, J. (2012). Five years of GWAS discovery. Am. J. Hum. Genet. 90 7–24.
  • Weir, B. S., Anderson, A. D. and Hepler, A. B. (2006). Genetic relatedness analysis: Modern data and new challenges. Nat. Rev. Genet. 7 771–780.
  • Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W. et al. (2010a). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42 565–569.
  • Yang, J., Lee, S. H., Goddard, M. E. and Visscher, P. M. (2010b). GCTA: A tool for genome-wide complex trait analysis. The American Journal of Human Genetics 88 76–82.
  • Yang, J., Manolio, T. A., Pasquale, L. R., Boerwinkle, E., Caporaso, N., Cunningham, J. M., De Andrade, M., Feenstra, B., Feingold, E., Hayes, M. G. et al. (2011). Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43 519–525.
  • Zabaneh, D., Chambers, J. C., Elliott, P., Scott, J., Balding, D. J. and Kooner, J. S. (2009). Heritability and genetic correlations of insulin resistance and component phenotypes in Asian Indian families using a multivariate analysis. Diabetologia 52 2585–2589.