Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact email@example.com with any questions.
We review mathematically tractable models for connected networks on random points in the plane, emphasizing the class of proximity graphs which deserves to be better known to applied probabilists and statisticians. We introduce and motivate a particular statistic R measuring shortness of routes in a network. We illustrate, via Monte Carlo in part, the trade-off between normalized network length and R in a one-parameter family of proximity graphs. How close this family comes to the optimal trade-off over all possible networks remains an intriguing open question.
The paper is a write-up of a talk developed by the first author during 2007–2009.
Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
This article discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100-fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board.
Non-Gaussian outcomes are often modeled using members of the so-called exponential family. Notorious members are the Bernoulli model for binary data, leading to logistic regression, and the Poisson model for count data, leading to Poisson regression. Two of the main reasons for extending this family are (1) the occurrence of overdispersion, meaning that the variability in the data is not adequately described by the models, which often exhibit a prescribed mean–variance link, and (2) the accommodation of hierarchical structure in the data, stemming from clustering in the data which, in turn, may result from repeatedly measuring the outcome, for various members of the same family, etc. The first issue is dealt with through a variety of overdispersion models, such as, for example, the beta-binomial model for grouped binary data and the negative-binomial model for counts. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. While both of these phenomena may occur simultaneously, models combining them are uncommon. This paper proposes a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. We place particular emphasis on so-called conjugate random effects at the level of the mean for the first aspect and normal random effects embedded within the linear predictor for the second aspect, even though our family is more general. The binary, count and time-to-event cases are given particular emphasis. Apart from model formulation, we present an overview of estimation methods, and then settle for maximum likelihood estimation with analytic–numerical integration. Implications for the derivation of marginal correlations functions are discussed. The methodology is applied to data from a study in epileptic seizures, a clinical trial in toenail infection named onychomycosis and survival data in children with asthma.
The Bayesian measure of sample information about the parameter, known as Lindley’s measure, is widely used in various problems such as developing prior distributions, models for the likelihood functions and optimal designs. The predictive information is defined similarly and used for model selection and optimal designs, though to a lesser extent. The parameter and predictive information measures are proper utility functions and have been also used in combination. Yet the relationship between the two measures and the effects of conditional dependence between the observable quantities on the Bayesian information measures remain unexplored. We address both issues. The relationship between the two information measures is explored through the information provided by the sample about the parameter and prediction jointly. The role of dependence is explored along with the interplay between the information measures, prior and sampling design. For the conditionally independent sequence of observable quantities, decompositions of the joint information characterize Lindley’s measure as the sample information about the parameter and prediction jointly and the predictive information as part of it. For the conditionally dependent case, the joint information about parameter and prediction exceeds Lindley’s measure by an amount due to the dependence. More specific results are shown for the normal linear models and a broad subfamily of the exponential family. Conditionally independent samples provide relatively little information for prediction, and the gap between the parameter and predictive information measures grows rapidly with the sample size. Three dependence structures are studied: the intraclass (IC) and serially correlated (SC) normal models, and order statistics. For IC and SC models, the information about the mean parameter decreases and the predictive information increases with the correlation, but the joint information is not monotone and has a unique minimum. Compensation of the loss of parameter information due to dependence requires larger samples. For the order statistics, the joint information exceeds Lindley’s measure by an amount which does not depend on the prior or the model for the data, but it is not monotone in the sample size and has a unique maximum.
We consider situations where data have been collected such that the sampling depends on the outcome of interest and possibly further covariates, as for instance in case-control studies. Graphical models represent assumptions about the conditional independencies among the variables. By including a node for the sampling indicator, assumptions about sampling processes can be made explicit. We demonstrate how to read off such graphs whether consistent estimation of the association between exposure and outcome is possible. Moreover, we give sufficient graphical conditions for testing and estimating the causal effect of exposure on outcome. The practical use is illustrated with a number of examples.
A two-groups mixed-effects model for the comparison of (normalized) microarray data from two treatment groups is considered. Most competing parametric methods that have appeared in the literature are obtained as special cases or by minor modification of the proposed model. Approximate maximum likelihood fitting is accomplished via a fast and scalable algorithm, which we call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of treatment × gene interactions, derived from the model, involve shrinkage estimates of both the interactions and of the gene specific error variances. Genes are classified as being associated with treatment based on the posterior odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our model-based approach also allows one to declare the non-null status of a gene by controlling the false discovery rate (FDR). It is shown in a detailed simulation study that the approach outperforms well-known competitors. We also apply the proposed methodology to two previously analyzed microarray examples. Extensions of the proposed method to paired treatments and multiple treatments are also discussed.
George C. Tiao was born in London in 1933. After graduating with a B.A. in Economics from National Taiwan University in 1955 he went to the US to obtain an M.B.A from New York University in 1958 and a Ph.D. in Economics from the University of Wisconsin, Madison in 1962. From 1962 to 1982 he was Assistant, Associate, Professor and Bascom Professor of Statistics and Business at the University of Wisconsin, Madison, and in the period 1973–1975 was Chairman of the Department of Statistics. He moved to the Graduate School of Business at the University of Chicago in 1982 and is the W. Allen Wallis Professor of Econometrics and Statistics (emeritus).
George Tiao has played a leading role in the development of Bayesian Statistics, Time Series Analysis and Environmental Statistics. He is co-author, with G.E.P. Box, of Bayesian Inference in Statistical Analysis and is the developer of a model-based approach to seasonal adjustment (with S. C. Hillmer), of outlier analysis in time series (with I. Chang), and of new ways of vector ARMA model building (with R. S. Tsay). He is the author/co-author/co-editor of 7 books and over 120 articles in refereed econometric, environmental and statistical journals and has been thesis advisor of over 25 students. He is a leading figure in the development of Statistics in Taiwan and China and is the Founding President of the International Chinese Statistical Association 1987–1988 and the Founding Chair Editor of the journal Statistica Sinica 1988–1993. He played a leading role (over the 20 year period 1979–1999) in the organization of the annual NBER/NSF Time Series Workshop and he was a founding member of the annual conference “Making Statistics More Effective in Schools of Business” 1986–2006. Among other honors he was elected ASA Fellow (1973), IMS Fellow (1974), member of Academia Sinica, Taiwan (1976) and ISI (1980), and was recipient of the Distinguished Service Medal, DGBAS, Taiwan 1993, the Julius Shiskin Award, 2001, the Wilks Memorial Medal Award, 2001, and the Statistician of the Year Award in 2005 (ASA Chicago Chapter). He received honorary doctorates in 2003 from the Universidad Carlos III de Madrid and National Tsinghua University, Hsinchu, Taiwan.