The Annals of Applied Statistics

Concise comparative summaries (CCS) of large text corpora with a human experiment

Jinzhu Jia, Luke Miratrix, Bin Yu, Brian Gawalt, Laurent El Ghaoui, Luke Barnesmoore, and Sophie Clavier

Full-text: Open access


In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field.

For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary.

To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of “Egypt” in the New York Times throughout the Arab Spring and an informal comparison of the New York Times’ and Wall Street Journal’s coverage of “energy.” Overall, we find that the Lasso with $L^{2}$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.

Article information

Ann. Appl. Stat., Volume 8, Number 1 (2014), 499-529.

First available in Project Euclid: 8 April 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Text summarization high-dimensional analysis sparse modeling Lasso L1 regularized logistic regression co-occurrence tf-idf L2 normalization


Jia, Jinzhu; Miratrix, Luke; Yu, Bin; Gawalt, Brian; El Ghaoui, Laurent; Barnesmoore, Luke; Clavier, Sophie. Concise comparative summaries (CCS) of large text corpora with a human experiment. Ann. Appl. Stat. 8 (2014), no. 1, 499--529. doi:10.1214/13-AOAS698.

Export citation


  • Bischof, J. M. and Airoldi, E. M. (2012). Summarizing topical content with word frequency and exclusivity. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) 201–208. Edinburgh, Scotland.
  • Blei, D. and McAuliffe, J. (2008). Supervised topic models. In Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 121–128. MIT Press, Cambridge, MA.
  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C. and Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.) 288–296. Vancouver, BC, Canada.
  • Clavier, S., El Ghaoui, L., Barnesmoore, L. and Li, G.-C. (2010). All the news that’s fit to compare: Comparing Chinese representations in the American Press and US representations in the Chinese press.
  • Dai, X., Jia, J., El Ghaoui, L. and Yu., B. (2011). SBA-term: Sparse bilingual association for terms. In Fifth IEEE International Conference on Semantic Computing (ICSC) 189–192. Stanford Univ., Palo Alto, CA.
  • Eisenstein, J., Smith, N. A. and Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 1 1365–1374. Association for Computational Linguistics, Portland, OR.
  • El Ghaoui, L., Viallon, V. and Rabbani, T. (2010). Safe feature elimination in sparse supervised learning. Technical Report No. UC/EECS-2010-126. EECS Dept., Univ. California, Berkeley.
  • El Ghaoui, L., Li, G.-C., Duong, V.-A., Pham, V., Srivastava, A. and Bhaduri, K. (2011). Sparse machine learning methods for understanding large text corpora: Application to flight reports. In Conference on Intelligent Data Understanding 159–173. Mountain View, CA.
  • Entman, R. M. (1993). Framing: Toward clarification of a fractured paradigm. Journal of Communication 43 52–57.
  • Entman, R. M. (2004). Projections of power framing news, public opinion, and U.S. foreign policy. Univ. Chicago, Chicago, IL.
  • Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 1289–1305.
  • Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. (1999). Domain-specific keyphrase extraction. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99) 668–673. Morgan Kaufmann, San Francisco, CA.
  • Gawalt, B., Jia, J., Miratrix, L. W., Ghaoui, L., Yu, B. and Clavier, S. (2010). Discovering word associations in news media via feature selection and sparse classification. In Proceedings of the International Conference on Multimedia Information Retrieval (MIR’10) 211–220. Philadelphia, PA.
  • Genkin, A., Lewis, D. D. and Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics 49 291–304.
  • Goffman, E. (1974). Frame Analysis: An Essay on the Organization of Experience. Harvard Univ. Press, Cambridge, MA.
  • Goldstein, J., Mittal, V., Carbonell, J. and Kantrowitz, M. (2000). Multi-document summarization by sentence extraction. In NAACL-ANLP 2000 Workshop on Automatic Summarization 40–48. Seattle, WA.
  • Grimmer, J., Shorey, R., Wallach, H. and Zlotnick, F. (2011). A class of Bayesian semiparametric cluster-topic models for political texts.
  • Hastie, T., Tibshirani, R. and Friedman, J. H. (2011). The Elements of Statistical Learning, Vol. 1. Springer, New York.
  • Hennig, L. (2009). Topic-based multi-document summarization with probabilistic latent semantic analysis. In Recent Advances in Natural Language Processing (RANLP) 144–149. Association for Computational Linguistics, Borovets, Bulgaria.
  • Hopkins, D. and King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science 54 229–247.
  • Ifrim, G., Bakir, G. and Weikum, G. (2008). Fast logistic regression for text categorization with variable-length N-grams. In 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 354–362. ACM, New York.
  • Jia, J., Miratrix, L. W., Gawalt, B., Yu, B. and El Ghaoui, L. (2011). What is in the news on a subject: Automatic and sparse summarization of large document corpora. Technical Report #801, Dept. Statistics, Univ. California, Berkeley.
  • Kiousis, S. and Wu, X. (2008). International agenda-building and agenda-setting: Exploring the influence of public relations counsel on US news media and public perceptions of foreign nations. The International Communications Gazette 70 58–75.
  • Kunczik, M. (2000). Globalisation: News media, images of nations and the flow of international capital with special reference to the role of rating agencies. J. International Communication 8 39–79.
  • Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A.-L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D. and Van Alstyne, M. (2009). Computational social science. Science 323 721–723.
  • Lee, L. and Chen, S. (2006). New methods for text categorization based on a new feature selection method and a new similarity measure between documents. Lecture Notes in Comput. Sci. 4031 1280.
  • McLeod, M., Kosicki, G. M. and Pan, Z. (1991). On Understanding and Misunderstanding Media Effects. Edward Arnold, London.
  • Monroe, B. L., Colaresi, M. P. and Quinn, K. M. (2008). Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16 372–403.
  • Mosteller, F. and Wallace, D. L. (1984). Applied Bayesian and Classical Inference: The Case of The Federalist Papers, 2nd ed. Springer, New York.
  • Neto, J. L., Freitas, A. A. and Kaestner, C. A. A. (2002). Automatic text summarization using a machine learning approach. In Advances in Artificial Intelligence. Lecture Notes in Computer Science 2507 205–215. Springer, Berlin.
  • Paul, M. J., Zhai, C. and Girju, R. (2010). Summarizing contrastive viewpoints in opinionated text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing 66–76. Association for Computational Linguistics, Stroudsburg, PA.
  • Pottker, H. (2003). News and its communicative quality: The inverted pyramid—when and why did it appear? Journalism Studies 4 501–511.
  • Rose, S., Engel, D., Cramer, N. and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory (M. W. Berry and J. Kogan, eds.). Wiley, Chichester.
  • Salton, G. (1991). Developments in automatic text retrieval. Science 253 974–980.
  • Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 513–523.
  • Senellart, P. and Blondel, V. D. (2008). Automatic discovery of similar words. In Survey of Text Mining II. Springer, Berlin.
  • Shahaf, D., Guestrin, C. and Horvitz, E. (2012). Trains of thought: Generating information maps. In Proceedings of the 21st International Conference on World Wide Web 899–908. ACM, Lyon, France.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
  • Wagstaff, K. L. (2012). Machine learning that matters. In 29th International Conference on Machine Learning 1–6. Edinburgh, Scotland.
  • Yang, Y. and Pendersen, I. O. (1997). A comparative study on feature selection in text categorization. In ICML-97, 14th International Conference on Machine Learning 412–420. Nashville, TN.
  • Zhang, T. and Oles, F. J. (2001). Text categorization based on regularized linear classfiication methods. Information Retrieval 4 5–31.
  • Zhao, P. and Yu, B. (2007). Stagewise lasso. J. Mach. Learn. Res. 8 2701–2726.
  • Zubiaga, A., Spina, D., Fresno, V. and Martínez, R. (2011). Classifying trending topics: A typology of conversation triggers on Twitter. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11) 2461–2464. ACM, New York.