The Annals of Statistics

Change-point detection in multinomial data with a large number of categories

Guanghui Wang, Changliang Zou, and Guosheng Yin

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We consider a sequence of multinomial data for which the probabilities associated with the categories are subject to abrupt changes of unknown magnitudes at unknown locations. When the number of categories is comparable to or even larger than the number of subjects allocated to these categories, conventional methods such as the classical Pearson’s chi-squared test and the deviance test may not work well. Motivated by high-dimensional homogeneity tests, we propose a novel change-point detection procedure that allows the number of categories to tend to infinity. The null distribution of our test statistic is asymptotically normal and the test performs well with finite samples. The number of change-points is determined by minimizing a penalized objective function based on segmentation, and the locations of the change-points are estimated by minimizing the objective function with the dynamic programming algorithm. Under some mild conditions, the consistency of the estimators of multiple change-points is established. Simulation studies show that the proposed method performs satisfactorily for identifying change-points in terms of power and estimation accuracy, and it is illustrated with an analysis of a real data set.

Article information

Source
Ann. Statist., Volume 46, Number 5 (2018), 2020-2044.

Dates
Received: December 2016
Revised: July 2017
First available in Project Euclid: 17 August 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1534492827

Digital Object Identifier
doi:10.1214/17-AOS1610

Mathematical Reviews number (MathSciNet)
MR3845009

Zentralblatt MATH identifier
06964324

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62H12: Estimation

Keywords
Asymptotic normality categorical data high-dimensional homogeneity test multiple change-point detection sparse contingency table

Citation

Wang, Guanghui; Zou, Changliang; Yin, Guosheng. Change-point detection in multinomial data with a large number of categories. Ann. Statist. 46 (2018), no. 5, 2020--2044. doi:10.1214/17-AOS1610. https://projecteuclid.org/euclid.aos/1534492827


Export citation

References

  • Agresti, A. (2013). Categorical Data Analysis, 3rd ed. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, NJ.
  • Aue, A., Hörmann, S., Horváth, L. and Reimherr, M. (2009). Break detection in the covariance structure of multivariate time series models. Ann. Statist. 37 4046–4087.
  • Bai, J. and Perron, P. (1998). Estimating and testing linear models with multiple structural changes. Econometrica 66 47–78.
  • Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
  • Baranov, A. P. and Baranov, Y. A. (2005). A power divergence test in the problem of sample homogeneity for a large number of outcomes and trials. Diskret. Mat. 17 19–48.
  • Braun, J. V., Braun, R. K. and Müller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87 301–314.
  • Bykov, S. I. and Ivanov, V. A. (1991). On the conditions of asymptotic normality of multidimensional randomized decomposable statistics. Diskret. Mat. Appl. 1 219–227.
  • Chen, J. and Gupta, A. K. (2000). Parametric Statistical Change Point Analysis. Birkhäuser, Boston, MA.
  • Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
  • Chen, H. and Zhang, N. R. (2013). Graph-based tests for two-sample comparisons of categorical data. Statist. Sinica 23 1479–1503.
  • Csörgő, M. and Horváth, L. (1997). Limit Theorems in Change-Point Analysis. Wiley Series in Probability and Statistics. Wiley, Chichester.
  • Fan, J., Liao, Y. and Yao, J. (2015). Power enhancement in high-dimensional cross-sectional tests. Econometrica 83 1497–1541.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Ann. Statist. 42 2243–2281.
  • Giné, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge Univ. Press, New York.
  • Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and Its Application. Academic Press, New York.
  • Hawkins, D. M. (2001). Fitting multiple change-point models to data. Comput. Statist. Data Anal. 37 323–341.
  • Holst, L. (1972). Asymptotic normality and efficiency for certain goodness-of-fit tests. Biometrika 59 137–145.
  • Horváth, L. and Serbinowska, M. (1995). Testing for changes in multinomial observations: The Lindisfarne scribes problem. Scand. J. Stat. 22 371–384.
  • Ivčenko, G. I. and Levin, V. V. (1976). The asymptotic normality of a certain class of statistics in a multinomial scheme. Teor. Verojatnost. i Primenen. 21 190–195.
  • Kallenberg, W. C. M. (1985). On moderate and large deviations in multinomial distributions. Ann. Statist. 13 1554–1580.
  • Killick, R., Fearnhead, P. and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost. J. Amer. Statist. Assoc. 107 1590–1598.
  • Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Process. 85 1501–1510.
  • Morris, C. (1975). Central limit theorems for multinomial sums. Ann. Statist. 3 165–188.
  • Perron, P. and Vogelsang, T. J. (1992). Testing for a unit root in a time series with a changing mean: Corrections and extensions. J. Bus. Econom. Statist. 10 467–470.
  • Srivastava, M. S. and Worsley, K. J. (1986). Likelihood ratio tests for a change in the multivariate normal mean. J. Amer. Statist. Assoc. 81 199–204.
  • Srivastava, M. S. and Wu, Y. H. (1993). Comparison of EWMA, CUSUM and Shiryayev-Roberts procedures for detecting a shift in the mean. Ann. Statist. 21 645–670.
  • Wang, G., Zou, C. and Yin, G. (2018). Supplement to “Change-point detection in multinomial data with a large number of categories.” DOI:10.1214/17-AOS1610SUPP.
  • Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ criterion. Statist. Probab. Lett. 6 181–189.
  • Zou, C., Yin, G., Feng, L. and Wang, Z. (2014). Nonparametric maximum likelihood approach to multiple change-point problems. Ann. Statist. 42 970–1002.

Supplemental materials

  • Supplement to “Change-point detection in multinomial data with a large number of categories”. The Supplementary Material contains all theoretical proofs of Theorems 1–5, Proposition 1 and Corollary 1 and additional simulation results.