## The Annals of Applied Statistics

### The screening and ranking algorithm to detect DNA copy number variations

#### Abstract

DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least $O(n^{2})$, where $n$ is the number of probes in the experiments. Moreover, the theoretical properties of those existing methods are not well understood. A faster and better characterized algorithm is desirable for the ultra high throughput data. In this study, we propose the Screening and Ranking algorithm (SaRa) which can detect CNVs fast and accurately with complexity down to $O(n)$. In addition, we characterize theoretical properties and present numerical analysis for our algorithm.

#### Article information

Source
Ann. Appl. Stat., Volume 6, Number 3 (2012), 1306-1326.

Dates
First available in Project Euclid: 31 August 2012

https://projecteuclid.org/euclid.aoas/1346418584

Digital Object Identifier
doi:10.1214/12-AOAS539

Mathematical Reviews number (MathSciNet)
MR3012531

Zentralblatt MATH identifier
06096532

#### Citation

Niu, Yue S.; Zhang, Heping. The screening and ranking algorithm to detect DNA copy number variations. Ann. Appl. Stat. 6 (2012), no. 3, 1306--1326. doi:10.1214/12-AOAS539. https://projecteuclid.org/euclid.aoas/1346418584

#### References

• Bhattacharya, P. K. (1994). Some aspects of change-point analysis. In Change-point Problems (South Hadley, MA, 1992). Institute of Mathematical Statistics Lecture Notes—Monograph Series 23 28–56. IMS, Hayward, CA.
• Braun, J. V., Braun, R. K. and Müller, H. G. (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87 301–314.
• Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. and Pritchard, J. K. (2006). A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38 75–81.
• Csörgő, M. and Horváth, L. (1997). Limit Theorems in Change-point Analysis. Wiley, Chichester.
• Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability 66. Chapman & Hall, London.
• Freeman, J. L., Perry, G. H., Feuk, L., Redon, R., McCarroll, S. A., Altshuler, D. M., Aburatani, H., Jones, K. W., Tyler-Smith, C. and Hurles, M. E. et al. (2006). Copy number variation: New insights in genome diversity. Genome Res. 16 949–961.
• Fridlyand, J., Snijders, A. M., Pinkel, D., Albertson, D. G. and Jain, A. N. (2004). Hidden Markov models approach to the analysis of array CGH data. J. Multivariate Anal. 90 132–153.
• Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
• Gijbels, I., Hall, P. and Kneip, A. (1999). On the estimation of jump points in smooth curves. Ann. Inst. Statist. Math. 51 231–251.
• Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Statist. 19 984–1006.
• Huang, T., Wu, B., Lizardi, P. and Zhao, H. (2005). Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics 21 3811–3817.
• James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71–83.
• Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21 3763–3770.
• McCarroll, S. A. and Altshuler, D. M. (2007). Copy-number variation and association studies of human disease. Nat. Genet. 39 S37–S42.
• McCarroll, S. A., Hadnott, T. N., Perry, G. H., Sabeti, P. C., Zody, M. C., Barrett, J. C., Dallaire, S., Gabriel, S. B., Lee, C. and Daly, M. J. et al. (2006). Common deletion polymorphisms in the human genome. Nat. Genet. 38 86–92.
• Niu, Y. S. and Zhang, H. (2012). Supplement to “The screening and ranking algorithm to detect DNA copy number variations.” DOI:10.1214/12-AOAS539SUPP.
• Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
• Peiffer, D. A., Le, J. M., Steemers, F. J., Chang, W., Jenniges, T., Garcia, F., Haden, K., Li, J., Shaw, C. A., Belmont, J., Cheung, S. W. W., Shen, R. M., Barker, D. L. and Gunderson, K. L. (2006). High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 16 1136–1148.
• Rinaldo, A. (2009). Properties and refinements of the fused lasso. Ann. Statist. 37 2922–2952.
• Sen, A. and Srivastava, M. S. (1975). On tests for detecting change in mean. Ann. Statist. 3 98–108.
• Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamilton, G., Hindle, A. K., Huey, B., Kimura, K., Law, S., Myambo, K., Palmer, J., Ylstra, B., Yue, J. P., Gray, J. W., Jain, A. N., Pinkel, D. and Albertson, D. G. (2001). Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet. 29 263–264.
• Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Statist. 39 1335–1371.
• Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9 18–29.
• Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
• Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657–663.
• Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H. and Bucan, M. (2007). PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17 1665–1674.
• Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ criterion. Statist. Probab. Lett. 6 181–189.
• Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function. Sankhyā Ser. A 51 370–381.
• Yin, Y. Q. (1988). Detection of the number, locations and magnitudes of jumps. Comm. Statist. Stochastic Models 4 445–455.
• Yin, X. L. and Li, J. (2010). Detecting copy number variations from array cgh data based on a conditional random field model. J. Bioinform. Comput. Biol. 8 295–314.
• Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63 22–32, 309.
• Zhang, F., Gu, W., Hurles, M. E. and Lupski, J. R. (2009). Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet. 10 451–481.
• Zhang, Z., Lange, K., Ophoff, R. and Sabatti, C. (2010). Reconstructing DNA copy number by penalized estimation and imputation. Ann. Appl. Stat. 4 1749–1773.

#### Supplemental materials

• Supplementary material: A description of general weight functions and technical proofs. The pdf file contains a description of general weight functions and the proof of Theorem 1.