Electronic Journal of Statistics

Cluster analysis of longitudinal profiles with subgroups

Xiaolu Zhu and Annie Qu

Full-text: Open access


In this paper, we cluster profiles of longitudinal data using a penalized regression method. Specifically, we allow heterogeneous variation of longitudinal patterns for each subject, and utilize a pairwise-grouping penalization on coefficients of the nonparametric B-spline models to form subgroups. Consequently, we identify clusters based on different patterns of the predicted longitudinal curves. One advantage of the proposed method is that there is no need to pre-specify the number of clusters; instead the number of clusters is selected automatically through a model selection criterion. Our method is also applicable for unbalanced data where different subjects could have measurements at different time points. To implement the proposed method, we develop an alternating direction method of multipliers (ADMM) algorithm which has the desirable convergence property. In theory, we establish the consistency properties for approximated nonparametric function estimation and subgrouping memberships. In addition, we show that our method outperforms the existing competitive approaches in our simulation studies and real data example.

Article information

Electron. J. Statist., Volume 12, Number 1 (2018), 171-193.

Received: August 2016
First available in Project Euclid: 31 January 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

ADMM longitudinal data minimax concave penalty model selection nonparametric spline method

Creative Commons Attribution 4.0 International License.


Zhu, Xiaolu; Qu, Annie. Cluster analysis of longitudinal profiles with subgroups. Electron. J. Statist. 12 (2018), no. 1, 171--193. doi:10.1214/17-EJS1389. https://projecteuclid.org/euclid.ejs/1517367715

Export citation


  • [1] Abraham, C., Cornillon, P.-A., Matzner-Løber, E., and Molinari, N. (2003). Unsupervised curve clustering using b-splines., Scandinavian Journal of Statistics 30, 3, 581–595.
  • [2] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers., Foundations and Trends in Machine Learning 3, 1, 1–122.
  • [3] Bronnenberg, B. J., Kruger, M. W., and Mela, C. F. (2008). Database paper - the iri marketing data set., Marketing Science 27, 4, 745–748.
  • [4] Burren, O. S., Rubio García, A., Javierre, B.-M., Rainbow, D. B., Cairns, J., Cooper, N. J., Lambourne, J. J., Schofield, E., Castro Dopico, X., Ferreira, R. C., Coulson, R., Burden, F., Rowlston, S. P., Downes, K., Wingett, S. W., Frontini, M., Ouwehand, W. H., Fraser, P., Spivakov, M., Todd, J. A., Wicker, L. S., Cutler, A. J., and Wallace, C. (2017). Chromosome contacts in activated t cells identify autoimmune disease candidate genes., Genome Biology 18, 1 (Sep), 165.
  • [5] Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering., Journal of Computational and Graphical Statistics 24, 4, 994–1013.
  • [6] Claeskens, G., Krivobokova, T., and Opsomer, J. D. (2009). Asymptotic properties of penalized spline estimators., Biometrika 96, 3, 529–544.
  • [7] Coffey, N., Hinde, J., and Holian, E. (2014). Clustering longitudinal profiles using p-splines and mixed effects models applied to time-course gene expression data., Computational Statistics & Data Analysis 71, 14–29.
  • [8] De Boor, C. (2001)., A practical guide to splines (revised ed.). New York, Springer.
  • [9] Eilers, P. H. and Marx, B. D. (1996). Flexible smoothing with b-splines and penalties., Statistical Science 11, 2, 89–102.
  • [10] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association 96, 456, 1348–1360.
  • [11] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation., Journal of the American Statistical Association 97, 458, 611–631.
  • [12] Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm., Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1, 100–108.
  • [13] Hastie, T., Tibshirani, R., and Walther, G. (2001). Estimating the number of data clusters via the gap statistic., Journal of the Royal Statistical Society. Series B 63, 411–423.
  • [14] Hsu, Y.-H., Zillikens, M. C., Wilson, S. G., Farber, C. R., Demissie, S., Soranzo, N., Bianchi, E. N., Grundberg, E., Liang, L., Richards, J. B., and others. (2010). An integration of genome-wide association study and gene expression profiling to prioritize the discovery of novel susceptibility loci for osteoporosis-related traits., PLoS Genetics 6, 6.
  • [15] Hubert, L. and Arabie, P. (1985). Comparing partitions., Journal of Classification 2, 1, 193–218.
  • [16] Jaccard, P. (1912). The distribution of the flora in the alpine zone., New Phytologist 11, 2, 37–50.
  • [17] Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with b-splines., Bioinformatics 19, 4, 474–482.
  • [18] Ma, P., Castillo-Davis, C. I., Zhong, W., and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data., Nucleic Acids Research 34, 4, 1261–1269.
  • [19] Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis., Journal of the American Statistical Association 112, 517, 410–423.
  • [20] Pan, W., Shen, X., and Liu, B. (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty., The Journal of Machine Learning Research 14, 1, 1865–1889.
  • [21] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods., Journal of the American Statistical Association 66, 336, 846–850.
  • [22] Ruppert, D. (2002). Selecting the number of knots for penalized splines., Journal of Computational and Graphical Statistics 11, 4, 735–757.
  • [23] Shen, X., Pan, W., and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation., Journal of the American Statistical Association 107, 497, 223–232.
  • [24] Wu, J., Zhu, J., Wang, L., and Wang, S. (2017). Genome-wide association study identifies nbs-lrr-encoding genes related with anthracnose and common bacterial blight in the common bean., Frontiers in Plant Science 8, 1398.
  • [25] Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms., IEEE Transactions on Neural Networks 16, 3, 645–678.
  • [26] Xue, L., Qu, A., and Zhou, J. (2010). Consistent model selection for marginal generalized additive model for correlated data., Journal of the American Statistical Association 105, 492, 1518–1530.
  • [27] Xue, L. and Yang, L. (2006). Additive coefficient modeling via polynomial spline., Statistica Sinica 16, 4, 1423–1446.
  • [28] Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete and continuous outcomes., Biometrics 42, 1, 121–130.
  • [29] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty., The Annals of Statistics 38, 2, 894–942.
  • [30] Zhou, S., Shen, X., and Wolfe, D. (1998). Local asymptotics for regression splines and confidence regions., The Annals of Statistics 26, 5, 1760–1782.