## The Annals of Statistics

### Robust and computationally feasible community detection in the presence of arbitrary outlier nodes

#### Abstract

Community detection, which aims to cluster $N$ nodes in a given graph into $r$ distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by $k$-means algorithm with $k=r$.

Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors’ knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

#### Article information

Source
Ann. Statist., Volume 43, Number 3 (2015), 1027-1059.

Dates
Revised: November 2014
First available in Project Euclid: 15 May 2015

https://projecteuclid.org/euclid.aos/1431695637

Digital Object Identifier
doi:10.1214/14-AOS1290

Mathematical Reviews number (MathSciNet)
MR3346696

Zentralblatt MATH identifier
1328.62381

#### Citation

Cai, T. Tony; Li, Xiaodong. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. Ann. Statist. 43 (2015), no. 3, 1027--1059. doi:10.1214/14-AOS1290. https://projecteuclid.org/euclid.aos/1431695637

#### References

• Adamic, A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery 36–43. ACM, New York.
• Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. IEEE Trans. Inform. Theory 48 569–579.
• Airoldi, E., Blei, M., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
• Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Math. Program. 147 429–465.
• Ames, B. P. W. and Vavasis, S. A. (2014). Convex optimization for the planted $k$-disjoint-clique problem. Math. Program. 143 299–337.
• Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122.
• Balakrishnan, S., Xu, M., Krishnamurthy, A. and Singh, A. (2011). Noise thresholds for spectral clustering (NIPS 2011). Adv. Neural Inf. Process. Syst. 25 954–962.
• Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance. Available at arXiv:1401.3915.
• Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068–21073.
• Bickel, P., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41 1922–1943.
• Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Faund. Trends Mach. Learn. 3 1–122.
• Cai, T. and Li, X. (2015). Supplement to “Robust and computationally feasible community detection in the presence of arbitrary outlier nodes.” DOI:10.1214/14-AOS1290SUPP.
• Candès, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241–1274.
• Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37.
• Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6 1847–1899.
• Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. J. Mach. Learn. Res. 23 35.1–35.23.
• Chen, Y., Sanghavi, S. and Xu, H. (2012). Clustering sparse graphs. Adv. Neural Inf. Process. Syst. 25 2213–2221.
• Chernoff, H. (1981). A note on an inequality involving the normal distribution. Ann. Probab. 9 533–535.
• Clauset, A., Newman, M. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E 70 066111.
• Coja-Oghlan, A. and Lanka, A. (2009/10). Finding planted partitions in random graphs with general degree distributions. SIAM J. Discrete Math. 23 1682–1714.
• Decelle, A., Krzakala, F., Moore, C. and Zdeborová, L. (2011). Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84 066106.
• Deshpande, Y. and Montanari, A. (2015). Finding hidden cliques of size $\sqrt{N/e}$ in nearly linear time. Found. Comput. Math. DOI:10.1007/s10208-014-9125-y. To appear.
• Fienberg, S. E. (2010). Introduction to papers on the modeling and analysis of network data. Ann. Appl. Stat. 4 1–4.
• Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 21 825–839.
• Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T. and Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM J. Matrix Anal. Appl. 34 23–39.
• Füredi, Z. and Komlós, J. (1981). The eigenvalues of random symmetric matrices. Combinatorica 1 233–241.
• Giesen, J. and Mitsche, D. (2005). Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory. Lecture Notes in Computer Science 3623 433–444. Springer, Berlin.
• Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129–233.
• Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
• Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137.
• Horn, R. A. and Johnson, C. R. (2013). Matrix Analysis, 2nd ed. Cambridge Univ. Press, Cambridge.
• Jalali, A., Chen, Y., Sanghavi, S. and Xu, H. (2014). Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15 2213–2238.
• Jin, J. (2015). Fast network community detection by SCORE. Ann. Statist. 43 57–89.
• Joseph, A. and Yu, B. (2013). Impact of regularization on spectral clustering. Available at arXiv:1312.1733.
• Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10.
• Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborová, L. and Zhang, P. (2013). Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci. USA 110 20935–20940.
• Kumar, A., Sabharwal, Y. and Sen, S. (2011). A simple linear time $(1+\epsilon)$-approximation algorithm for $k$-means clustering in any dimensions. J. ACM 58 11.
• Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist. 43 215–237.
• Li, X. and Voroninski, V. (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45 3019–3033.
• Lin, Z., Liu, R. and Su, Z. (2011). Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems (NIPS) 612–620.
• Mathieu, C. and Schudy, W. (2010). Correlation clustering with noisy input. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 712–728. SIAM, Philadelphia, PA.
• McSherry, F. (2001). Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) 529–537. IEEE Computer Soc., Los Alamitos, CA.
• Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69 026113.
• Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564–9569.
• Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
• Oymak, S. and Hassibi, B. (2011). Finding dense clusters via low rank $+$ sparse decomposition. Available at arXiv:1104.5186.
• Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878–1915.
• Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochastic blockmodels. Available at arXiv:1310.1495.
• Shamir, R. and Tsur, D. (2007). Improved algorithms for the random cluster graph model. Random Structures Algorithms 31 418–449.
• Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100.
• Sussman, D. L., Tang, M., Fishkind, D. E. and Priebe, C. E. (2012). A consistent adjacency spectral embedding for stochastic blockmodel graphs. J. Amer. Statist. Assoc. 107 1119–1128.
• Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389–434.
• Vu, V. H. (2007). Spectral norm of random matrices. Combinatorica 27 721–736.
• Vu, V. (2014). A simple SVD algorithm for finding hidden partitions. Available at arXiv:1404.3918.
• Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266–2292.

#### Supplemental materials

• Supplemental materials to “Robust and computationally feasible community detection in the presence of arbitrary outliers nodes”. We give in the supplement proofs to Lemmas 6.6, 6.7, 6.8, 6.9, 6.10 and 6.11.