## The Annals of Statistics

### A partially linear framework for massive heterogeneous data

#### Abstract

We consider a partially linear framework for modeling massive heterogeneous data. The major goal is to extract common features across all subpopulations while exploring heterogeneity of each subpopulation. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (nonasymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracle result holds when the number of subpopulations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of subpopulations. All the above results require to regularize each subestimation as though it had the entire sample. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is statistical inferences for general kernel ridge regression. Thorough numerical results are also provided to back up our theory.

#### Article information

Source
Ann. Statist., Volume 44, Number 4 (2016), 1400-1437.

Dates
Revised: October 2015
First available in Project Euclid: 7 July 2016

https://projecteuclid.org/euclid.aos/1467894703

Digital Object Identifier
doi:10.1214/15-AOS1410

Mathematical Reviews number (MathSciNet)
MR3519928

Zentralblatt MATH identifier
1358.62050

#### Citation

Zhao, Tianqi; Cheng, Guang; Liu, Han. A partially linear framework for massive heterogeneous data. Ann. Statist. 44 (2016), no. 4, 1400--1437. doi:10.1214/15-AOS1410. https://projecteuclid.org/euclid.aos/1467894703

#### References

• Aitkin, M. and Rubin, D. B. (1985). Estimation and hypothesis testing in finite mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 47 67–75.
• Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. Preprint. Available at arXiv:1208.2015.
• Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic, Boston, MA.
• Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. Ann. Statist. 1 1071–1095.
• Birman, M. S. and Solomyak, M. Z. (1967). Piecewise-polynomial approximations of functions of the classes $w_{p}^{\alpha}$. Mat. Sb. 115 331–355.
• Chen, X. and Xie, M. (2012). A split-and-conquer approach for analysis of extraordinarily large data, Technical Report 2012-01, Dept. Statistics, Rutgers Univ., Piscataway, NJ.
• Cheng, G. and Shang, Z. (2015). Joint asymptotics for semi-nonparametric regression models with partially linear structure. Ann. Statist. 43 1351–1390.
• Cheng, G., Zhang, H. H. and Shang, Z. (2015). Sparse and efficient estimation for partial spline models with increasing dimension. Ann. Inst. Statist. Math. 67 93–127.
• Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. Ann. Statist. 27 1491–1518.
• Figueiredo, M. A. and Jain, A. K. (2002). Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24 381–396.
• Gu, C. (2013). Smoothing Spline ANOVA Models, 2nd ed. Springer, New York.
• Guo, W. (2002). Inference in smoothing spline analysis of variance. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 887–898.
• Härdle, W., Liang, H. and Gao, J. (2000). Partially Linear Models. Physica, Heidelberg.
• Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55 757–796.
• Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. (2012). The big data bootstrap. Preprint. Available at arXiv:1206.6415.
• Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York.
• Krasikov, I. (2004). New bounds on the Hermite polynomials. East J. Approx. 10 355–362.
• Lafferty, J. and Lebanon, G. (2005). Diffusion kernels on statistical manifolds. J. Mach. Learn. Res. 6 129–163.
• Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. Ann. Statist. 36 261–286.
• Li, R., Lin, D. K. J. and Li, B. (2013). Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind. 29 399–409.
• Mammen, E. and van de Geer, S. (1997). Penalized quasi-likelihood estimation in partial linear models. Ann. Statist. 25 1014–1035.
• McDonald, R., Hall, K. and Mann, G. (2010). Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Los Angeles, CA.
• McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
• Meinshausen, N. and Bühlmann, P. (2015). Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43 1801–1830.
• Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 29–43. Springer, Berlin.
• Nardi, Y. and Rinaldo, A. (2008). On the asymptotic properties of the group lasso estimator for linear models. Electron. J. Stat. 2 605–633.
• Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2008). Union support recovery in high-dimensional multivariate regression. In 46th Annual Allerton Conference on Communication, Control, and Computing. IEEE, Allerton House, UIUC, IL.
• Raskutti, G., Wainwright, M. J. and Yu, B. (2014). Early stopping and non-parametric regression: An optimal data-dependent stopping rule. J. Mach. Learn. Res. 15 335–366.
• Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. Cambridge Univ. Press, Cambridge.
• Saunders, C., Gammerman, A. and Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning (ICML-1998). Morgan Kaufmann, San Mateo, CA.
• Shang, Z. and Cheng, G. (2013). Local and global asymptotic inference in smoothing spline models. Ann. Statist. 41 2608–2638.
• Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge.
• Sollich, P. and Williams, C. K. (2005). Understanding Gaussian process regression using the equivalent kernel. In Deterministic and Statistical Methods in Machine Learning 211–228. Springer, Berlin.
• Städler, N., Bühlmann, P. and van de Geer, S. (2010). $\ell_{1}$-penalization for mixture regression models. TEST 19 209–256.
• Steinwart, I., Hush, D. R., Scovel, C. et al. (2009). Optimal rates for regularized least squares regression. In Conference on Learning Theory. Montreal, Canada.
• Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
• Wang, Y. (2011). Smoothing Splines: Methods and Applications. CRC Press, Boca Raton, FL.
• Wang, X. and Dunson, D. B. (2013). Parallel mcmc via weierstrass sampler. Preprint. Available at arXiv:1312.4605.
• Yatchew, A. (2003). Semiparametric Regression for the Applied Econometrician. Cambridge Univ. Press, Cambridge.
• Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 17 2077–2098.
• Zhang, Y., Duchi, J. and Wainwright, M. (2013). Divide and conquer kernel ridge regression. In Conference on Learning Theory. Princeton, NJ.
• Zhao, T., Cheng, G. and Liu, H. (2016). Supplement to “A partially linear framework for massive heterogeneous data.” DOI:10.1214/15-AOS1410SUPP.