June 2024 Readability prediction: How many features are necessary?
Florian Schwendinger, Laura Vana, Kurt Hornik
Author Affiliations +
Ann. Appl. Stat. 18(2): 1010-1034 (June 2024). DOI: 10.1214/23-AOAS1820

Abstract

Traditionally, readability prediction has relied on readability formulas, which are based on shallow text characteristics such as average word and sentence length. With recent advances in text mining and natural language processing, more complex text properties can be incorporated into readability prediction models, with papers in the literature suggesting to use up to 200 features for predicting text readability. However, many of the features generated using natural language processing tools are highly correlated and can be thought to measure similar latent text properties. When dealing with a high-dimensional space of correlated features, removing the redundant variables has two advantages: (1) improving interpretability and (2) increasing the predictive power of the model. In this paper we propose an ordinal version of the averaged lasso, which combines hierarchical clustering with the lasso, in order to identify relevant features for readability prediction. We illustrate the approach on two corpora and show improved prediction accuracy when benchmarking against a set of competing models. The annotated corpora as well as the steps necessary for feature creation are freely available as R packages, thus allowing the obtained results to be directly incorporated into a readability estimation pipeline.

Funding Statement

Laura Vana and Florian Schwendinger acknowledge funding from the Austrian Science Fund (FWF) for the project “High-dimensional statistical learning: New methods to advance economic and sustainability policies” (ZK 35), jointly carried out by Alpen-Adria University Klagenfurt, Vienna University of Economics and Business (WU), Paris Lodron University Salzburg, TU Wien and the Austrian Institute of Economic Research (WIFO).

Acknowledgments

The authors would like to thank the Associate Editors and the anonymous referees for valuable insights in improving the manuscript.

Citation

Download Citation

Florian Schwendinger. Laura Vana. Kurt Hornik. "Readability prediction: How many features are necessary?." Ann. Appl. Stat. 18 (2) 1010 - 1034, June 2024. https://doi.org/10.1214/23-AOAS1820

Information

Received: 1 June 2020; Revised: 1 May 2023; Published: June 2024
First available in Project Euclid: 5 April 2024

Digital Object Identifier: 10.1214/23-AOAS1820

Keywords: Averaged ordinal lasso , Model selection , NLP , pipeline , readability prediction

Rights: Copyright © 2024 Institute of Mathematical Statistics

JOURNAL ARTICLE
25 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.18 • No. 2 • June 2024
Back to Top