## Annals of Statistics

### Partial correlation screening for estimating large precision matrices, with applications to classification

#### Abstract

Given $n$ samples $X_{1},X_{2},\ldots,X_{n}$ from $N(0,\Sigma)$, we are interested in estimating the $p\times p$ precision matrix $\Omega=\Sigma^{-1}$; we assume $\Omega$ is sparse in that each row has relatively few nonzeros.

We propose Partial Correlation Screening (PCS) as a new row-by-row approach. To estimate the $i$th row of $\Omega$, $1\leq i\leq p$, PCS uses a Screen step and a Clean step. In the Screen step, PCS recruits a (small) subset of indices using a stage-wise algorithm, where in each stage, the algorithm updates the set of recruited indices by adding the index $j$ that has the largest empirical partial correlation (in magnitude) with $i$, given the set of indices recruited so far. In the Clean step, PCS reinvestigates all recruited indices, removes false positives and uses the resultant set of indices to reconstruct the $i$th row.

PCS is computationally efficient and modest in memory use: to estimate a row of $\Omega$, it only needs a few rows (determined sequentially) of the empirical covariance matrix. PCS is able to execute an estimation of a large $\Omega$ (e.g., $p=10K$) in a few minutes.

Higher Criticism Thresholding (HCT) is a recent classifier that enjoys optimality, but to exploit its full potential, we need a good estimate of $\Omega$. Note that given an estimate of $\Omega$, we can always combine it with HCT to build a classifier (e.g., HCT-PCS, HCT-glasso).

We have applied HCT-PCS to two microarray data sets ($p=8K$ and $10K$) for classification, where it not only significantly outperforms HCT-glasso, but also is competitive to the Support Vector Machine (SVM) and Random Forest (RF). These suggest that PCS gives more useful estimates of $\Omega$ than the glasso; we study this carefully and have gained some interesting insight.

We show that in a broad context, PCS fully recovers the support of $\Omega$ and HCT-PCS is optimal in classification. Our theoretical study sheds interesting light on the behavior of stage-wise procedures.

#### Article information

Source
Ann. Statist., Volume 44, Number 5 (2016), 2018-2057.

Dates
First available in Project Euclid: 12 September 2016

https://projecteuclid.org/euclid.aos/1473685267

Digital Object Identifier
doi:10.1214/15-AOS1392

Mathematical Reviews number (MathSciNet)
MR3546442

Zentralblatt MATH identifier
1349.62269

#### Citation

Huang, Shiqiong; Jin, Jiashun; Yao, Zhigang. Partial correlation screening for estimating large precision matrices, with applications to classification. Ann. Statist. 44 (2016), no. 5, 2018--2057. doi:10.1214/15-AOS1392. https://projecteuclid.org/euclid.aos/1473685267

#### References

• [1] An, H., Huang, D., Yao, Q. and Zhang, C. H. (2014). Stepwise searching for feature variables in high-dimensional linear regression. Unpublished manuscript.
• [2] Anonymous (2006 (retrieved)). Elephant and the blind men. Jain Stories.
• [3] Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist. 39 2533–2556.
• [4] Bi, J., Bennett, K., Embrechts, M., Breneman, C. and Song, M. (2003). Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 3 1229–1243.
• [5] Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
• [6] Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
• [7] Breiman, L. (2001). Random forests. Mach. Learn. 24 5–32.
• [8] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• [9] Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 121–167.
• [10] Cai, T., Liu, W. and Luo, X. (2011). A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594–607.
• [11] Cawley, G. C. and Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11 2079–2107.
• [12] Dettling, M. and Buhlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
• [13] Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
• [14] Donoho, D. and Jin, J. (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci. 30 1–25.
• [15] Donoho, D. L., Tsaig, Y., Drori, I. and Starck, J.-L. (2012). Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Inform. Theory 58 1094–1121.
• [16] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
• [17] Efron, B. (2009). Empirical Bayes estimates for large-scale prediction problems. J. Amer. Statist. Assoc. 104 1015–1028.
• [18] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
• [19] Fan, J., Feng, Y. and Tong, X. (2012). A road to classification in high dimensional space: The regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 745–771.
• [20] Fan, J., Ke, Z. T., Liu, H. and Xia, L. (2015). QUADRO: A supervised dimension reduction method via Rayleigh quotient optimization. Ann. Statist. 43 1498–1534.
• [21] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
• [22] Fan, Y., Jin, J. and Yao, Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 2537–2571.
• [23] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
• [24] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
• [25] Hsieh, C.-J., Dhillon, I. S., Ravikumar, P. K. and Sustik, M. A. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems 2330–2338. Neural Information Processing Systems Foundation.
• [26] Huang, S., Jin, J. and Yao, Z. (2015). Supplement to “Partial correlation screening for estimating large precision matrices, with applications to classification.” DOI:10.1214/15-AOS1392SUPP.
• [27] Ingster, Y. I., Pouet, C. and Tsybakov, A. B. (2009). Classification of sparse high-dimensional vectors. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4427–4448.
• [28] Jager, L. and Wellner, J. A. (2007). Goodness-of-fit tests via phi-divergences. Ann. Statist. 35 2018–2053.
• [29] Jin, J. and Ke, Z. (2016). Rare and weak effects in large-scale inference: Methods and phase diagrams. Statistica Sinica. To appear.
• [30] Jin, J., Zhang, C.-H. and Zhang, Q. (2014). Optimality of graphlet screening in high dimensional variable selection. J. Mach. Learn. Res. 15 2723–2772.
• [31] Ke, Z. T., Jin, J. and Fan, J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42 2202–2242.
• [32] Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. J. Amer. Statist. Assoc. 107 1129–1139.
• [33] Mazumder, R. and Hastie, T. (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13 781–794.
• [34] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_{1}$-penalized log-determinant divergence. Electron. J. Stat. 5 935–980.
• [35] Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis, 2nd ed. Wiley, Hoboken, NJ.
• [36] Spiegelhalter, D. J. (2014). Statistics. The future lies in uncertainty. Science 345 264–265.
• [37] Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
• [38] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.
• [39] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
• [40] Witten, D. M., Friedman, J. H. and Simon, N. (2011). New insights and faster computations for the graphical lasso. J. Comput. Graph. Statist. 20 892–900.
• [41] Yousefi, M. R., Hua, J., Sima, C. and Dougherty, E. R. (2010). Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26 68–76.
• [42] Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11 2261–2286.
• [43] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35.
• [44] Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression. J. Mach. Learn. Res. 10 555–568.
• [45] Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Trans. Inform. Theory 57 4689–4708.
• [46] Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13 1059–1062.
• [47] Zhong, P.-S., Chen, S. X. and Xu, M. (2013). Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Ann. Statist. 41 2820–2851.

#### Supplemental materials

• Supplementary material for “Partial correlation screening for estimating large precision matrices, with applications to classification”. Owing to space constraints, some technical proofs are relegated to a supplementary document.