The Annals of Applied Statistics

Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines

Gwenaël G. R. Leday, Aad W. van der Vaart, Wessel N. van Wieringen, and Mark A. van de Wiel

Full-text: Open access


DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexibility to identify any plausible type of relationship. The specification of the model leads to estimation and model selection in a constrained, nonstandard setting. We provide methodology for testing the effect of DNA on mRNA and choosing the appropriate model. Furthermore, we present a novel approach to obtain reliable confidence bands for constrained PLRS, which incorporates model uncertainty. The procedures are applied to colorectal and breast cancer data. Common assumptions are found to be potentially misleading for biologically relevant genes. More flexible models may bring more insight in the interaction between the two markers.

Article information

Ann. Appl. Stat., Volume 7, Number 2 (2013), 823-845.

First available in Project Euclid: 27 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

DNA copy number mRNA expression regression splines constrained inference model selection confidence bands


Leday, Gwenaël G. R.; van der Vaart, Aad W.; van Wieringen, Wessel N.; van de Wiel, Mark A. Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines. Ann. Appl. Stat. 7 (2013), no. 2, 823--845. doi:10.1214/12-AOAS605.

Export citation


  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) (B. N. Petrov and F. Csaki, eds.) 267–281. Akadémiai Kiadó, Budapest.
  • Andrews, D. W. K. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica 68 399–405.
  • Arnold, B. C. and Shavelle, R. M. (1998). Joint confidence sets for the mean and variance of a normal distribution. Amer. Statist. 52 133–140.
  • Asimit, J. L., Andrulis, I. L. and Bull, S. B. (2011). Regression models, scan statistics and reappearance probabilities to detect regions of association between gene expression and copy number. Stat. Med. 30 1157–1178.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Bicciato, S., Spinelli, R., Zampieri, M., Mangano, E., Ferrari, F., Beltrame, L., Cifola, I., Peano, C., Solari, A. and Battaglia, C. (2009). A computational procedure to identify significant overlap of differentially expressed and genomic imbalanced regions in cancer datasets. Nucleic Acids Res. 37 5057–5070.
  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
  • Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52 345–370.
  • Brown, L. D., Cai, T. T. and DasGupta, A. (2003). Interval estimation in exponential families. Statist. Sinica 13 19–49.
  • Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997). Model selection: An integral part of inference. Biometrics 53 603–618.
  • Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed. Springer, New York.
  • Carvalho, B., Postma, C., Mongera, S., Hopmans, E., Diskin, S., van de Wiel, M. A., van Criekinge, W., Thas, O., Matthäi, A., Cuesta, M. A., Droste, J. S. T. S., Craanen, M., Schröck, E., Ylstra, B. and Meijer, G. A. (2009). Multiple putative oncogenes at the chromosome 20q amplicon contribute to colorectal adenoma to carcinoma progression. Gut 58 79–89.
  • Chernoff, H. (1954). On the distribution of the likelihood ratio. Ann. Math. Statistics 25 573–578.
  • Gouriéroux, C., Holly, A. and Monfort, A. (1982). Likelihood ratio test, Wald test, and Kuhn–Tucker test in linear models with inequality constraints on the regression parameters. Econometrica 50 63–80.
  • Grömping, U. (2010). Inference with linear equality and inequality constraints using R: The package ic.infer. J. Stat. Softw. 33 1–31.
  • Gu, W., Choi, H. and Ghosh, D. (2008). Global associations between copy number and transcript mRNA microarray data: An empirical study. Cancer Inform. 6 17–23.
  • Hughes, A. W. and King, M. L. (2003). Model selection using AIC in the presence of one-sided information. J. Statist. Plann. Inference 115 397–411.
  • Jörnsten, R., Abenius, T., Kling, T., Schmidt, L., Johansson, E., Nordling, T. E. M., Nordlander, B., Sander, C., Gennemark, P., Funa, K., Nilsson, B., Lindahl, L. and Nelander, S. (2011). Network modeling of the transcriptional effects of copy number aberrations in glioblastoma. Mol. Syst. Biol. 7 486.
  • Kodde, D. A. and Palm, F. C. (1986). Wald criteria for jointly testing equality and inequality restrictions. Econometrica 54 1243–1248.
  • Kudô, A. (1963). A multivariate analogue of the one-sided test. Biometrika 50 403–418.
  • Leday, G. G. R., van der Vaart, A. W., van Wieringen, W. N. and van de Wiel, M. A. (2013). Supplement to “Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines.” DOI:10.1214/12-AOAS605SUPP.
  • Lee, H., Kong, S. W. and Park, P. J. (2008). Integrative analysis reveals the direct and indirect interactions between DNA copy number aberrations and gene expression changes. Bioinformatics 24 889–896.
  • Lipson, D., Ben-Dor, A., Dehan, E. and Yakhini, Z. (2004). Joint analysis of DNA copy numbers and gene expression levels. In Algorithms in Bioinformatics. Lecture Notes in Computer Science 3240 135–146. Springer, Berlin.
  • Meeker, W. Q. and Escobar, L. A. (1995). Teaching about approximate confidence regions based on maximum likelihood estimation. Amer. Statist. 49 48–53.
  • Menezes, R., Boetzer, M., Sieswerda, M., van Ommen, G. J. and Boer, J. (2009). Integrated analysis of DNA copy number and gene expression microarray data using gene sets. BMC Bioinformatics 10 203+.
  • Neve, R. M., Chin, K., Fridlyand, J., Yeh, J., Baehner, F. L., Fevr, T., Clark, L., Bayani, N., Coppe, J.-P. P., Tong, F., Speed, T., Spellman, P. T., DeVries, S., Lapuk, A., Wang, N. J., Kuo, W.-L. L., Stilwell, J. L., Pinkel, D., Albertson, D. G., Waldman, F. M., McCormick, F., Dickson, R. B., Johnson, M. D., Lippman, M., Ethier, S., Gazdar, A. and Gray, J. W. (2006). A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. 10 515–527.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 53–77.
  • Pinkel, D. and Albertson, D. G. (2005). Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 37 Suppl. S11–S17.
  • Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet. 32 Suppl. 496–501.
  • Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, Chichester.
  • Salari, K., Tibshirani, R. and Pollack, J. R. (2010). DR-Integrator: A new analytic tool for integrating DNA copy number and gene expression data. Bioinformatics 26 414–416.
  • Schäfer, M., Schwender, H., Merk, S., Haferlach, C., Ickstadt, K. and Dugas, M. (2009). Integrated analysis of copy number alterations and gene expression: A bivariate assessment of equally directed abnormalities. Bioinformatics 25 3228–3235.
  • Shapiro, A. (1988). Towards a unified theory of inequality constrained testing in multivariate analysis. Internat. Statist. Rev. 56 49–62.
  • Silvapulle, M. J. and Sen, P. K. (2005). Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Wiley, Hoboken, NJ.
  • Solvang, H. K., Lingjærde, O. C., Frigessi, A., Børresen-Dale, A.-L. and Kristensen, V. N. (2011). Linear and non-linear dependencies between copy number aberrations and mRNA expression reveal distinct molecular pathways in breast cancer. BMC Bioinformatics 12 197.
  • Soneson, C., Lilljebjörn, H., Fioretos, T. and Fontes, M. (2010). Integrative analysis of gene expression and copy number alterations using canonical correlation analysis. BMC Bioinformatics 11 191.
  • van de Wiel, M. A., Kim, K. I., Vosse, S. J., van Wieringen, W. N., Wilting, S. M. and Ylstra, B. (2007). CGHcall: Calling aberrations for array CGH tumor profiles. Bioinformatics 23 892–894.
  • van de Wiel, M. A., Picard, F., van Wieringen, W. N. and Ylstra, B. (2011). Preprocessing and downstream analysis of microarray DNA copy number profiles. Brief. Bioinformatics 12 10–21.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
  • van Wieringen, W. N., Berkhof, J. and van de Wiel, M. A. (2010). A random coefficients model for regional co-expression associated with DNA copy number. Stat. Appl. Genet. Mol. Biol. 9 30.
  • van Wieringen, W. N., van de Wiel, M. A. and Ylstra, B. (2007). Normalized, segmented or called aCGH data? Cancer Inform. 3 321–327.
  • van Wieringen, W. N. and van de Wiel, M. A. (2009). Nonparametric testing for DNA copy number induced differential mRNA gene expression. Biometrics 65 19–29.
  • VanAntwerp, J. (2000). A tutorial on linear and bilinear matrix inequalities. J. Process Contr. 10 363–385.
  • Vandenberghe, L. and Boyd, S. (1996). Semidefinite programming. SIAM Rev. 38 49–95.

Supplemental materials

  • Supplementary material: Complementary results and simulations. We present a simulation study which compares the performance of the PLRS testing procedure in detecting associations of various functional shapes with that of other procedures. Additionally, we provide an overlap comparison of model selection procedures, complementary results for the simulation on point estimation and a description of the simulation on the precision of knots.