## The Annals of Applied Statistics

### Improving sequence-based genotype calls with linkage disequilibrium and pedigree information

#### Abstract

Whole and targeted sequencing of human genomes is a promising, increasingly feasible tool for discovering genetic contributions to risk of complex diseases. A key step is calling an individual’s genotype from the multiple aligned short read sequences of his DNA, each of which is subject to nucleotide read error. Current methods are designed to call genotypes separately at each locus from the sequence data of unrelated individuals. Here we propose likelihood-based methods that improve calling accuracy by exploiting two features of sequence data. The first is the linkage disequilibrium (LD) between nearby SNPs. The second is the Mendelian pedigree information available when related individuals are sequenced. In both cases the likelihood involves the probabilities of read variant counts given genotypes, summed over the unobserved genotypes. Parameters governing the prior genotype distribution and the read error rates can be estimated either from the sequence data itself or from external reference data. We use simulations and synthetic read data based on the 1000 Genomes Project to evaluate the performance of the proposed methods. An R-program to apply the methods to small families is freely available at http://med.stanford.edu/epidemiology/PHGC/.

#### Article information

Source
Ann. Appl. Stat., Volume 6, Number 2 (2012), 457-475.

Dates
First available in Project Euclid: 11 June 2012

https://projecteuclid.org/euclid.aoas/1339419603

Digital Object Identifier
doi:10.1214/11-AOAS527

Mathematical Reviews number (MathSciNet)
MR2976478

Zentralblatt MATH identifier
1243.62138

#### Citation

Zhou, Baiyu; Whittemore, Alice S. Improving sequence-based genotype calls with linkage disequilibrium and pedigree information. Ann. Appl. Stat. 6 (2012), no. 2, 457--475. doi:10.1214/11-AOAS527. https://projecteuclid.org/euclid.aoas/1339419603

#### References

• 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature 467 1061–1073.
• Bansal, V. et al. (2010). Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Research 20 537–545.
• Bentley, D. R. et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456 53–59.
• Bross, I. (1954). Misclassification in $2\times 2$ tables. Biometrics 10 478–486.
• Clayton, D. G. et al. (2005). Population structure, differential bias and genomic control in a large-scale, case–control association study. Nature Genetics 37 1243–1246.
• Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1–38.
• Drmanac, R. et al. (2010). Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327 78–81.
• Gordon, D. et al. (2002). Power and sample size calculation for case–control genetic association tests when errors are present: Application to single nucleotide polymorphisms. Human Heredity 54 22–23.
• Kim, S. Y. et al. (2010). Design of association studies with pooled or un-pooled next-generation sequencing data. Genetic Epidemiology 34 479–491.
• Kruglyak, L., Daly, M. J., Reeve-Daly, M. P. and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58 1347–1363.
• Li, H. et al. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18 1851–1858.
• Lin, Y., Tseng, G. C., Cheong, S. Y., Bean, L. J. H., Sherman, S. L. and Feingold, E. (2008). Smarter clustering methods for SNP genotype calling. Bioinformatics 24 2665–2671.
• Martin, E. R. (2010). SeqEM: An adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics 26 2803–2810.
• McKernan, K. J. et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research 19 1527–1541.
• Nielsen, R. et al. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12 443–451.
• Sabatti, C. and Lange, K. (2008). Bayesian Gaussian mixture models for high-density genotyping arrays. J. Amer. Statist. Assoc. 103 89–100.
• Thompson, E. A. (1974). Gene identities and multiple relationships. Biometrics 30 667–680.
• Whittemore, A. S. and Halpern, J. (1994). A class of tests for linkage using affected pedigree members. Biometrics 50 118–127.
• Yu, Z. et al. (2009). Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics 10 63.