The Annals of Statistics

Distributed inference for quantile regression processes

Stanislav Volgushev, Shih-Kang Chao, and Guang Cheng

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditional quantile regression process through projection based on these estimated quantile curves. Our general quantile regression framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifice any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. As an important application, the statistical inference on conditional distribution functions is considered. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from subsamples and can be carried out at almost no additional computational cost. Simulations confirm our statistical inferential theory.

Article information

Source
Ann. Statist., Volume 47, Number 3 (2019), 1634-1662.

Dates
Received: February 2017
Revised: March 2018
First available in Project Euclid: 13 February 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1550026852

Digital Object Identifier
doi:10.1214/18-AOS1730

Mathematical Reviews number (MathSciNet)
MR3911125

Zentralblatt MATH identifier
07053521

Subjects
Primary: 62F12: Asymptotic properties of estimators 62G15: Tolerance and confidence regions 62G20: Asymptotic properties

Keywords
B-spline estimation conditional distribution function distributed computing divide-and-conquer quantile regression process

Citation

Volgushev, Stanislav; Chao, Shih-Kang; Cheng, Guang. Distributed inference for quantile regression processes. Ann. Statist. 47 (2019), no. 3, 1634--1662. doi:10.1214/18-AOS1730. https://projecteuclid.org/euclid.aos/1550026852


Export citation

References

  • Banerjee, M., Durot, C. and Sen, B. (2019). Divide and conquer in non-standard problems and the super-efficiency phenomenon. Ann. Statist. To appear.
  • Belloni, A., Chernozhukov, V., Chetverikov, D. and Fernández-Val, I. (2017). Conditional quantile processes based on series or many regressors. Preprint. Available at arXiv:1105.6154.
  • Chao, S.-K., Volgushev, S. and Cheng, G. (2017). Quantile processes for semi and nonparametric regression. Electron. J. Stat. 11 3272–3331.
  • Chernozhukov, V., Fernández-Val, I. and Galichon, A. (2010). Quantile and probability curves without crossing. Econometrica 78 1093–1125.
  • Hagemann, A. (2017). Cluster-robust bootstrap inference in quantile regression models. J. Amer. Statist. Assoc. 112 446–456.
  • Jordan, M. I. (2013). On statistics, computation and scalability. Bernoulli 19 1378–1390.
  • Kato, K. (2012). Asymptotic normality of Powell’s kernel estimator. Ann. Inst. Statist. Math. 64 255–273.
  • Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 795–816.
  • Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. Cambridge Univ. Press, Cambridge.
  • Li, R., Lin, D. K. J. and Li, B. (2013). Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind. 29 399–409.
  • Powell, J. L. (1986). Censored regression quantiles. J. Econometrics 32 143–155.
  • Schumaker, L. L. (1981). Spline Functions: Basic Theory. Wiley, New York.
  • Sengupta, S., Volgushev, S. and Shao, X. (2016). A subsampled double bootstrap for massive data. J. Amer. Statist. Assoc. 111 1222–1232.
  • Shang, Z. and Cheng, G. (2017). Computational limits of a distributed algorithm for smoothing spline. J. Mach. Learn. Res. 18 Paper No. 108.
  • Shi, C., Lu, W. and Song, R. (2017). A massive data framework for M-estimators with cubic-rate. J. Amer. Statist. Assoc. To appear. DOI:10.1080/01621459.2017.1360779.
  • Volgushev, S. (2013). Smoothed quantile regression processes for binary response models. Preprint. Available at arXiv:1302.5644.
  • Volgushev, S., Chao, S.-K. and Cheng, G. (2019). Supplement to “Distributed inference for quantile regression processes.” DOI:10.1214/18-AOS1730SUPP.
  • White, T. (2012). Hadoop: The Definitive Guide. O’Reilly Media/Yahoo Press.
  • Xu, G., Shang, Z. and Cheng, G. (2016). Optimal tuning for divide-and-conquer kernel ridge regression with massive data. Preprint. Available at arXiv:1612.05907.
  • Zhang, Y., Duchi, J. C. and Wainwright, M. J. (2013). Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14 3321–3363.
  • Zhang, Y., Duchi, J. and Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16 3299–3340.
  • Zhao, T., Cheng, G. and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Ann. Statist. 44 1400–1437.
  • Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics for regression splines and confidence regions. Ann. Statist. 26 1760–1782.

Supplemental materials

  • Supplement to “Distributed inference for quantile regression processes”. The supplement contains additional technical remarks, simulation results and all proofs.