Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact firstname.lastname@example.org with any questions.
When statistical decision theory was emerging as a promising new paradigm, Charles Stein was to play a major role in the development of minimax theory for invariant statistical problems. In some of his earliest work with Gil Hunt, he set out to prove that, in problems where invariant procedures have constant risk, any best invariant test would be minimax among all tests. Although finding it not quite true in general, this led to the legendary Hunt–Stein theorem, which established the result under restrictive conditions on the underlying group of transformations. In decision problems invariant under such suitable groups, an overall minimax test was guaranteed to reside within the class of invariant procedures where it would typically be much easier to find. But when it did not seem possible to establish this result for invariance under the full linear group, he instead turned to prove its impossibility with counterexamples such as the nonminimaxity of the usual sample covariance estimator where the full linear group was just too big for the Hunt–Stein theorem to apply. Further explorations of invariance such as the sometimes problematic inference under a fiducial distribution, or the characterization of a best invariant procedure as a formal Bayes procedure under a right Haar prior, are further examples of the far reaching influence of Stein’s contributions to invariance theory.
Charles Stein made fundamental contributions to admissibility and inadmissibility in estimation and testing. This paper surveys some of the more important ones. Particular attention will be paid to his monumentally important, and at the time, incredibly surprising discovery of the inadmissibility of the usual estimator of the mean in three and higher dimensions. His result on admissibility of Pitman’s estimator of a mean in one and two dimensions, and his results on estimation of a mean matrix and a covariance matrix are also discussed. His work on testing is briefly covered.
This paper is a short exposition of Stein’s method of normal approximation from my personal perspective. It focuses mainly on the characterization of the normal distribution and the construction of Stein identities. Through examples, it provides glimpses into the many approaches to constructing Stein identities and the diverse applications of Stein’s method to mathematical problems. It also includes anecdotes of historical interest, including how Stein discovered his method and how I found an unpublished proof of his of the Berry–Esseen theorem.
Stein’s formula states that a random variable of the form is mean-zero for all functions f with integrable gradient. Here, is the divergence of the function f and is a standard normal vector. This paper aims to propose a second-order Stein formula to characterize the variance of such random variables for all functions with square integrable gradient, and to demonstrate the usefulness of this second-order Stein formula in various applications.
In the Gaussian sequence model, a remarkable consequence of Stein’s formula is Stein’s Unbiased Risk Estimate (SURE), an unbiased estimate of the mean squared risk for almost any given estimator of the unknown mean vector. A first application of the second-order Stein formula is an Unbiased Risk Estimate for SURE itself (SURE for SURE): an unbiased estimate providing information about the squared distance between SURE and the squared estimation error of . SURE for SURE has a simple form as a function of the data and is applicable to all with square integrable gradient, for example, the Lasso and the Elastic Net.
In addition to SURE for SURE, the following statistical applications are developed: (1) upper bounds on the risk of SURE when the estimation target is the mean squared error; (2) confidence regions based on SURE and using the second-order Stein formula; (3) oracle inequalities satisfied by SURE-tuned estimates under a mild Lipschtiz assumption; (4) an upper bound on the variance of the size of the model selected by the Lasso, and more generally an upper bound on the variance of the empirical degrees-of-freedom of convex penalized estimators; (5) explicit expressions of SURE for SURE for the Lasso and the Elastic Net; (6) in the linear model, a general semiparametric scheme to de-bias a differentiable initial estimator for the statistical inference of a low-dimensional projection of the unknown regression coefficient vector, with a characterization of the variance after debiasing; and (7) an accuracy analysis of a Gaussian Monte Carlo scheme to approximate the divergence of functions .
We study graphons as a nonparametric generalization of stochastic block models, and show how to obtain compactly represented estimators for sparse networks in this framework. In contrast to previous work, we relax the usual boundedness assumption for the generating graphon and instead assume only integrability, so that we can handle networks that have long tails in their degree distributions. We also relax the usual assumption that the graphon is defined on the unit interval, to allow latent position graphs based on more general spaces.
We analyze three algorithms. The first is a least squares algorithm, which gives a consistent estimator for all square-integrable graphons, with errors expressed in terms of the best possible stochastic block model approximation. Next, we analyze an algorithm based on the cut norm, which works for all integrable graphons. Finally, we show that clustering based on degrees works whenever the underlying degree distribution is atomless.
Many popular random partition models, such as the Chinese restaurant process and its two-parameter extension, fall in the class of exchangeable random partitions, and have found wide applicability in various fields. While the exchangeability assumption is sensible in many cases, it implies that the size of the clusters necessarily grows linearly with the sample size, and such feature may be undesirable for some applications. We present here a flexible class of nonexchangeable random partition models, which are able to generate partitions whose cluster sizes grow sublinearly with the sample size, and where the growth rate is controlled by one parameter. Along with this result, we provide the asymptotic behaviour of the number of clusters of a given size, and show that the model can exhibit a power-law behaviour, controlled by another parameter. The construction is based on completely random measures and a Poisson embedding of the random partition, and inference is performed using a Sequential Monte Carlo algorithm. Experiments on real data sets emphasise the usefulness of the approach compared to a two-parameter Chinese restaurant process.
Historically time-reversibility of the transitions or processes underpinning Markov chain Monte Carlo methods (MCMC) has played a key role in their development, while the self-adjointness of associated operators together with the use of classical functional analysis techniques on Hilbert spaces have led to powerful and practically successful tools to characterise and compare their performance. Similar results for algorithms relying on nonreversible Markov processes are scarce. We show that for a type of nonreversible Monte Carlo Markov chains and processes, of current or renewed interest in the physics and statistical literatures, it is possible to develop comparison results which closely mirror those available in the reversible scenario. We show that these results shed light on earlier literature, proving some conjectures and strengthening some earlier results.
This paper provides a strong approximation, or coupling, theory for spot volatility estimators formed using high-frequency data. We show that the t-statistic process associated with the nonparametric spot volatility estimator can be strongly approximated by a growing-dimensional vector of independent variables defined as functions of Brownian increments. We use this coupling theory to study the uniform inference for the volatility process in an infill asymptotic setting. Specifically, we propose uniform confidence bands for spot volatility, beta, idiosyncratic variance processes, and their nonlinear transforms. The theory is also applied to address an open question concerning the inference of monotone nonsmooth integrated volatility functionals such as the occupation time and its quantiles.
Ann. Statist. 49 (4), 1999-2020, (August 2021) DOI: 10.1214/20-AOS2024
KEYWORDS: nonparametric inference, high dimensionality, Distance correlation, test of independence, nonlinear dependence detection, central limit theorem, rate of convergence, power, blockchain, 62E20, 62H20, 62G10, 62G20
Distance correlation has become an increasingly popular tool for detecting the nonlinear dependence between a pair of potentially high-dimensional random vectors. Most existing works have explored its asymptotic distributions under the null hypothesis of independence between the two random vectors when only the sample size or the dimensionality diverges. Yet its asymptotic null distribution for the more realistic setting when both sample size and dimensionality diverge in the full range remains largely underdeveloped. In this paper, we fill such a gap and develop central limit theorems and associated rates of convergence for a rescaled test statistic based on the bias-corrected distance correlation in high dimensions under some mild regularity conditions and the null hypothesis. Our new theoretical results reveal an interesting phenomenon of blessing of dimensionality for high-dimensional distance correlation inference in the sense that the accuracy of normal approximation can increase with dimensionality. Moreover, we provide a general theory on the power analysis under the alternative hypothesis of dependence, and further justify the capability of the rescaled distance correlation in capturing the pure nonlinear dependency under moderately high dimensionality for a certain type of alternative hypothesis. The theoretical results and finite-sample performance of the rescaled statistic are illustrated with several simulation examples and a blockchain application.
We consider the problem of constructing pointwise confidence intervals in the multiple isotonic regression model. Recently, Han and Zhang (2020) obtained a pointwise limit distribution theory for the so-called block max–min and min–max estimators (Fokianos, Leucht and Neumann (2020); Deng and Zhang (2020)) in this model, but inference remains a difficult problem due to the nuisance parameter in the limit distribution that involves multiple unknown partial derivatives of the true regression function.
In this paper, we show that this difficult nuisance parameter can be effectively eliminated by taking advantage of information beyond point estimates in the block max–min and min–max estimators. Formally, let (resp. ) be the maximizing lower-left (resp. minimizing upper-right) vertex in the block max–min (resp. min–max) estimator, and be the average of the block max–min and min–max estimators. If all (first-order) partial derivatives of are nonvanishing at , then the following pivotal limit distribution theory holds:
Here is the number of design points in the block , σ is the standard deviation of the errors, and is a universal limit distribution free of nuisance parameters. This immediately yields confidence intervals for with asymptotically exact confidence level and oracle length. Notably, the construction of the confidence intervals, even new in the univariate setting, requires no more efforts than performing an isotonic regression once using the block max–min and min–max estimators, and can be easily adapted to other common monotone models including, for example, (i) monotone density estimation, (ii) interval censoring model with current status data, (iii) counting process model with panel count data, and (iv) generalized linear models. Extensive simulations are carried out to support our theory.
For finite parameter spaces, among decision procedures with finite risk functions, a decision procedure is extended admissible if and only if it is Bayes. Various relaxations of this classical equivalence have been established for infinite parameter spaces, but these extensions are each subject to technical conditions that limit their applicability, especially to modern (semi and nonparametric) statistical problems. Using results in mathematical logic and nonstandard analysis, we extend this equivalence to arbitrary statistical decision problems: informally, we show that, among decision procedures with finite risk functions, a decision procedure is extended admissible if and only if it has infinitesimal excess Bayes risk. In contrast to existing results, our equivalence holds in complete generality, that is, without regularity conditions or restrictions on the model or loss function. We also derive a nonstandard analogue of Blyth’s method that yields sufficient conditions for admissibility, and apply the nonstandard theory to derive a purely standard theorem: when risk functions are continuous on a compact Hausdorff parameter space, a procedure is extended admissible if and only if it is Bayes.
Mendelian randomization (MR) has become a popular approach to study the effect of a modifiable exposure on an outcome by using genetic variants as instrumental variables. A challenge in MR is that each genetic variant explains a relatively small proportion of variance in the exposure and there are many such variants, a setting known as many weak instruments. To this end, we provide a theoretical characterization of the statistical properties of two popular estimators in MR: the inverse-variance weighted (IVW) estimator and the IVW estimator with screened instruments using an independent selection dataset, under many weak instruments. We then propose a debiased IVW estimator, a simple modification of the IVW estimator, that is robust to many weak instruments and does not require screening. Additionally, we present two instrument selection methods to improve the efficiency of the new estimator when a selection dataset is available. An extension of the debiased IVW estimator to handle balanced horizontal pleiotropy is also discussed. We conclude by demonstrating our results in simulated and real datasets.
Given functional data from a survival process with time-dependent covariates, we derive a smooth convex representation for its nonparametric log-likelihood functional and obtain its functional gradient. From this, we devise a generic gradient boosting procedure for estimating the hazard function nonparametrically. An illustrative implementation of the procedure using regression trees is described to show how to recover the unknown hazard. The generic estimator is consistent if the model is correctly specified; alternatively, an oracle inequality can be demonstrated for tree-based models. To avoid overfitting, boosting employs several regularization devices. One of them is stepsize restriction, but the rationale for this is somewhat mysterious from the viewpoint of consistency. Our work brings some clarity to this issue by revealing that stepsize restriction is a mechanism for preventing the curvature of the risk from derailing convergence.
We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes consistent in all metric spaces admitting any such learner, making it an “optimistically universal” Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the k-NN classifier and its variants are not generally universally Bayes consistent, except under additional structural assumptions, such as an inner product, a norm, finite dimension or a Besicovitch-type property.
The metric spaces in which universal Bayes consistency is possible are the “essentially separable” ones—a notion that we define, which is more general than standard separability. The existence of metric spaces that are not essentially separable is widely believed to be independent of the ZFC axioms of set theory. We prove that essential separability exactly characterizes the existence of a universal Bayes-consistent learner for the given metric space. In particular, this yields the first impossibility result for universal Bayes consistency.
Taken together, our results completely characterize strong and weak universal Bayes consistency in metric spaces.
We consider the problem of conditional independence testing of X and Y given Z where and Z are three real random variables and Z is continuous. We focus on two main cases—when X and Y are both discrete, and when X and Y are both continuous. In view of recent results on conditional independence testing [Ann. Statist.48 (2020) 1514–1538], one cannot hope to design nontrivial tests, which control the type I error for all absolutely continuous conditionally independent distributions, while still ensuring power against interesting alternatives. Consequently, we identify various, natural smoothness assumptions on the conditional distributions of as z varies in the support of Z, and study the hardness of conditional independence testing under these smoothness assumptions. We derive matching lower and upper bounds on the critical radius of separation between the null and alternative hypotheses in the total variation metric. The tests we consider are easily implementable and rely on binning the support of the continuous variable Z. To complement these results, we provide a new proof of the hardness result of Shah and Peters [Ann. Statist.48 (2020) 1514–1538].
We propose a novel estimator for the number of mixture components (denoted by M) in a nonparametric finite mixture model. The setting that we consider is one where the analyst has repeated observations of variables that are conditionally independent given a finitely supported latent variable with M support points. Under a mild assumption on the joint distribution of the observed and latent variables, we show that an integral operator T that is identified from the data has rank equal to M. We use this observation, in conjunction with the fact that singular values of operators are stable under perturbations, to propose an estimator of M, which essentially consists of a thresholding rule that counts the number of singular values of a consistent estimator of T that are greater than a data-driven threshold. We prove that our estimator of M is consistent, and establish nonasymptotic results, which provide finite sample performance guarantees for our estimator. We present a Monte Carlo study, which shows that our estimator performs well for samples of moderate size.
We consider the robust algorithms for the k-means clustering problem where a quantizer is constructed based on N independent observations. Our main results are median of means based nonasymptotic excess distortion bounds that hold under the two bounded moments assumption in a general separable Hilbert space. In particular, our results extend the renowned asymptotic result of (Ann. Statist.9 (1981) 135–140) who showed that the existence of two moments is sufficient for strong consistency of an empirically optimal quantizer in . In a special case of clustering in , under two bounded moments, we prove matching (up to constant factors) nonasymptotic upper and lower bounds on the excess distortion, which depend on the probability mass of the lightest cluster of an optimal quantizer. Our bounds have the sub-Gaussian form, and the proofs are based on the versions of uniform bounds for robust mean estimators.
Recent results in nonparametric regression show that deep learning, that is, neural network estimates with many hidden layers, are able to circumvent the so-called curse of dimensionality in case that suitable restrictions on the structure of the regression function hold. One key feature of the neural networks used in these results is that their network architecture has a further constraint, namely the network sparsity. In this paper, we show that we can get similar results also for least squares estimates based on simple fully connected neural networks with ReLU activation functions. Here, either the number of neurons per hidden layer is fixed and the number of hidden layers tends to infinity suitably fast for sample size tending to infinity, or the number of hidden layers is bounded by some logarithmic factor in the sample size and the number of neurons per hidden layer tends to infinity suitably fast for sample size tending to infinity. The proof is based on new approximation results concerning deep neural networks.
This paper introduces the -descent, an iterative algorithm which operates on measures and performs α-divergence minimisation in a Bayesian framework. This gradient-based procedure extends the commonly-used variational approximation by adding a prior on the variational parameters in the form of a measure. We prove that for a rich family of functions Γ, this algorithm leads at each step to a systematic decrease in the α-divergence and derive convergence results. Our framework recovers the Entropic Mirror Descent algorithm and provides an alternative algorithm that we call the Power Descent. Moreover, in its stochastic formulation, the -descent allows to optimise the mixture weights of any given mixture model without any information on the underlying distribution of the variational parameters. This renders our method compatible with many choices of parameters updates and applicable to a wide range of Machine Learning tasks. We demonstrate empirically on both toy and real-world examples the benefit of using the Power Descent and going beyond the Entropic Mirror Descent framework, which fails as the dimension grows.
We propose a method for the detection of a change point in a sequence of distributions, which are available through a large number of observations at each . Under the null hypothesis, the distributions are equal. Under the alternative hypothesis, there is a change point , such that for and some unknown distribution G, which is not equal to . The change point, if it exists, is unknown, and the distributions before and after the potential change point are unknown. The decision about the existence of a change point is made sequentially, as new data arrive. At each time i, the count of observations, N, can increase to infinity. The detection procedure is based on a weighted version of the Wasserstein distance. Its asymptotic and finite sample validity is established. Its performance is illustrated by an application to returns on stocks in the S&P 500 index.
As a general rule of thumb the resolution of a light microscope (i.e., the ability to discern objects) is predominantly described by the full width at half maximum (FWHM) of its point spread function (psf)—the diameter of the blurring density at half of its maximum. Classical wave optics suggests a linear relationship between FWHM and resolution also manifested in the well-known Abbe and Rayleigh criteria, dating back to the end of the 19th century. However, during the last two decades conventional light microscopy has undergone a shift from microscopic scales to nanoscales. This increase in resolution comes with the need to incorporate the random nature of observations (light photons) and challenges the classical view of discernability, as we argue in this paper. Instead, we suggest a statistical description of resolution obtained from such random data. Our notion of discernability is based on statistical testing whether one or two objects with the same total intensity are present. For Poisson measurements, we get linear dependence of the (minimax) detection boundary on the FWHM, whereas for a homogeneous Gaussian model the dependence of resolution is nonlinear. Hence, at small physical scales modeling by homogeneous gaussians is inadequate, although often implicitly assumed in many reconstruction algorithms. In contrast, the Poisson model and its variance stabilized Gaussian approximation seem to provide a statistically sound description of resolution at the nanoscale. Our theory is also applicable to other imaging setups, such as telescopes.
The Lasso is a popular regression method for high-dimensional problems in which the number of parameters , is larger than the number n of samples: . A useful heuristics relates the statistical properties of the Lasso estimator to that of a simple soft-thresholding denoiser, in a denoising problem in which the parameters are observed in Gaussian noise, with a carefully tuned variance. Earlier work confirmed this picture in the limit , pointwise in the parameters θ and in the value of the regularization parameter.
Here, we consider a standard random design model and prove exponential concentration of its empirical distribution around the prediction provided by the Gaussian denoising model. Crucially, our results are uniform with respect to θ belonging to balls, , and with respect to the regularization parameter. This allows us to derive sharp results for the performances of various data-driven procedures to tune the regularization.
Our proofs make use of Gaussian comparison inequalities, and in particular of a version of Gordon’s minimax theorem developed by Thrampoulidis, Oymak and Hassibi, which controls the optimum value of the Lasso optimization problem. Crucially, we prove a stability property of the minimizer in Wasserstein distance that allows one to characterize properties of the minimizer itself.
This paper establishes asymptotic theory for optimal estimation of change points in general time series models under α-mixing conditions. We show that the Bayes-type estimator is asymptotically minimax for change-point estimation under squared error loss. Two bootstrap procedures are developed to construct confidence intervals for the change points. An approximate limiting distribution of the change-point estimator under small change is also derived. Simulations and real data applications are presented to investigate the finite sample performance of the Bayes-type estimator and the bootstrap procedures.
In a seminal article, Berger, De Oliveira and Sansó [J. Amer. Statist. Assoc.96 (2001) 1361–1374] compare several objective prior distributions for the parameters of Gaussian process models with isotropic correlation kernel. The reference prior distribution stands out among them insofar as it always leads to a proper posterior. They prove this result for rough correlation kernels: Spherical, Exponential with power , Matérn with smoothness . This paper provides a proof for smooth correlation kernels: Exponential with power , Matérn with smoothness , Rational Quadratic, along with tail rates of the reference prior for these kernels.
In network analysis, within-community members are more likely to be connected than between-community members, which is reflected in that the edges within a community are intercorrelated. However, existing probabilistic models for community detection such as the stochastic block model (SBM) are not designed to capture the dependence among edges. In this paper, we propose a new community detection approach to incorporate intra-community dependence of connectivities through the Bahadur representation. The proposed method does not require specifying the likelihood function, which could be intractable for correlated binary connectivities. In addition, the proposed method allows for heterogeneity among edges between different communities. In theory, we show that incorporating correlation information can achieve a faster convergence rate compared to the independent SBM, and the proposed algorithm has a lower estimation bias and accelerated convergence compared to the variational EM. Our simulation studies show that the proposed algorithm outperforms the existing multinetwork community detection methods assuming conditional independence among edges. We also demonstrate the application of the proposed method to agricultural product trading networks from different countries and to brain fMRI imaging networks.