## The Annals of Statistics

- Ann. Statist.
- Volume 45, Number 5 (2017), 2151-2189.

### Phase transitions for high dimensional clustering and related problems

Jiashun Jin, Zheng Tracy Ke, and Wanjie Wang

#### Abstract

Consider a two-class clustering problem where we observe $X_{i}=\ell_{i}\mu+Z_{i}$, $Z_{i}\stackrel{\mathit{i.i.d.}}{\sim}N(0,I_{p})$, $1\leq i\leq n$. The feature vector $\mu\in R^{p}$ is unknown but is presumably sparse. The class labels $\ell_{i}\in\{-1,1\}$ are also unknown and the main interest is to estimate them.

We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the *Region of Impossibility* and *Region of Possibility*. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam’s idea on comparison of experiments.

We also extend the study on statistical limits for clustering to that for signal recovery and that for global testing. We compare the statistical limits for three problems and expose some interesting insight.

We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t>0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^{2}$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering.

We discover a phase transition for IF-PCA. For any threshold $t>0$, let $\xi^{(t)}$ be the first left singular vector of the post-selection data matrix. The phase space partitions into two different regions. In one region, there is a $t$ such that $\cos(\xi^{(t)},\ell)\rightarrow 1$ and IF-PCA yields successful clustering. In the other, $\cos(\xi^{(t)},\ell)\leq c_{0}<1$ for all $t>0$.

Our results require delicate analysis, especially on post-selection random matrix theory and on lower bound arguments.

#### Article information

**Source**

Ann. Statist., Volume 45, Number 5 (2017), 2151-2189.

**Dates**

Received: March 2015

Revised: June 2016

First available in Project Euclid: 31 October 2017

**Permanent link to this document**

https://projecteuclid.org/euclid.aos/1509436831

**Digital Object Identifier**

doi:10.1214/16-AOS1522

**Mathematical Reviews number (MathSciNet)**

MR3718165

**Zentralblatt MATH identifier**

06821122

**Subjects**

Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 62H25: Factor analysis and principal components; correspondence analysis

Secondary: 62G05: Estimation 62G10: Hypothesis testing

**Keywords**

Clustering comparison of experiments feature selection hypothesis testing $L^{1}$-distance lower bound low-rank matrix recovery phase transition

#### Citation

Jin, Jiashun; Ke, Zheng Tracy; Wang, Wanjie. Phase transitions for high dimensional clustering and related problems. Ann. Statist. 45 (2017), no. 5, 2151--2189. doi:10.1214/16-AOS1522. https://projecteuclid.org/euclid.aos/1509436831

#### Supplemental materials

- Supplementary Material for “Phase transitions for high dimensional clustering and related problems”. Owing to space constraints, some technical proofs and discussion are relegated a supplementary document [27]. It contains proofs of Lemmas 2.1–2.4 and 3.1–3.3, and discusses an extension of the ARW model.Digital Object Identifier: doi:10.1214/16-AOS1522SUPPSupplemental files are immediately available to subscribers. Non-subscribers gain access to supplemental files with the purchase of the article.