The Annals of Statistics

A lasso for hierarchical interactions

Jacob Bien, Jonathan Taylor, and Robert Tibshirani

Full-text: Open access


We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting “saved” by the hierarchy constraint.

We distinguish between parameter sparsity—the number of nonzero coefficients—and practical sparsity—the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method.

Article information

Ann. Statist., Volume 41, Number 3 (2013), 1111-1141.

First available in Project Euclid: 13 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J07: Ridge regression; shrinkage estimators

Regularized regression lasso interactions hierarchical sparsity convexity


Bien, Jacob; Taylor, Jonathan; Tibshirani, Robert. A lasso for hierarchical interactions. Ann. Statist. 41 (2013), no. 3, 1111--1141. doi:10.1214/13-AOS1096.

Export citation


  • Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wiley-Interscience, New York.
  • Bach, F. (2011). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4 1–106.
  • Bach, F., Jenatton, R., Mairal, J., Obozinski, G. (2012). Structured sparsity through convex optimization. Statist. Sci. 27 450–468.
  • Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • Bickel, P., Ritov, Y. and Tsybakov, A. (2010). Hierarchical selection of variables in sparse high-dimensional regression. In Borrowing Strength: Theory Powering Applications—A Festschrift for Lawrence D. Brown. Inst. Math. Stat. Collect. 6 56–69. Inst. Math. Statist., Beachwood, OH.
  • Bien, J., Taylor, J. and Tibshirani, R. (2013). Supplement to “A lasso for hierarchical interactions.” DOI:10.1214/13-AOS1096SUPP.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3 1–124.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth Advanced Books and Software, Belmont, CA.
  • Chipman, H. (1996). Bayesian variable selection with related predictors. Canad. J. Statist. 24 17–36.
  • Choi, N. H., Li, W. and Zhu, J. (2010). Variable selection with the strong heredity constraint and its oracle property. J. Amer. Statist. Assoc. 105 354–364.
  • Cox, D. R. (1984). Interaction. Internat. Statist. Rev. 52 1–31.
  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461–470.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Forina, M., Armanino, C., Lanteri, S. and Tiscornia, E. (1983). Classification of olive oils from their fatty acid composition. In Food Research and Data Analysis 189–214. Applied Science Publishers, London.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1–141.
  • Friedman, J. H., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22.
  • George, E. and McCulloch, R. (1993). Variable selection via gibbs sampling. J. Amer. Statist. Assoc. 88 884–889.
  • Hamada, M. and Wu, C. (1992). Analysis of designed experiments with complex aliasing. Journal of Quality Technology 24 130–137.
  • Jenatton, R., Audibert, J.-Y. and Bach, F. (2011). Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12 2777–2824.
  • Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2010). Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML).
  • McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman & Hall, London.
  • Nardi, Y. and Rinaldo, A. (2012). The log-linear group-lasso estimator and its asymptotic properties. Bernoulli 18 945–974.
  • Nelder, J. A. (1977). A reformulation of linear models. J. Roy. Statist. Soc. Ser. A 140 48–76.
  • Nelder, J. A. (1997). Letters to the editors: Functional marginality is important. J. R. Stat. Soc. Ser. C. Appl. Stat. 46 281–286.
  • Obozinski, G., Jacob, L. and Vert, J. (2011). Group lasso with overlaps: The latent group lasso approach. Available at arXiv:1110.0413.
  • Park, M. and Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Biostatistics 9 30–50.
  • Peixoto, J. (1987). Hierarchical variable selection in polynomial regression models. Amer. Statist. 41 311–313.
  • Radchenko, P. and James, G. M. (2010). Variable selection using adaptive nonlinear interaction structures in high dimensions. J. Amer. Statist. Assoc. 105 1541–1553.
  • Rhee, S., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. and Shafer, R. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proc. Natl. Acad. Sci. USA 103 17355.
  • Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Statist. 39 1335–1371.
  • Tibshirani, R. J. and Taylor, J. (2012). Degrees of freedom in lasso problems. Ann. Statist. 40 1198–1232.
  • Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109 475–494.
  • Turlach, B. (2004). Discussion of “Least angle regression.” Ann. Statist. 32 481–490.
  • Wu, J., Devlin, B., Ringquist, S., Trucco, M. and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genetic Epidemiology 34 275–285.
  • Yuan, M., Joseph, V. R. and Lin, Y. (2007). An efficient variable selection approach for analyzing designed experiments. Technometrics 49 430–439.
  • Yuan, M., Joseph, V. R. and Zou, H. (2009). Structured variable selection and estimation. Ann. Appl. Stat. 3 1738–1757.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.
  • Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173–2192.

Supplemental materials

  • Supplementary material: Supplement to “A lasso for hierarchical interactions”. We include proofs of Property 1 and of the statement in Remark 3. Additionally, we show that the algorithm for the logistic regression case is nearly identical and give more detail on Algorithm 2.