The Annals of Applied Statistics

Exploiting TIMSS and PIRLS combined data: Multivariate multilevel modelling of student achievement

Leonardo Grilli, Fulvia Pennoni, Carla Rampichini, and Isabella Romeo

Full-text: Open access


We illustrate how to perform a multivariate multilevel analysis in the complex setting of large-scale assessment surveys, dealing with plausible values and accounting for the survey design. In particular, we consider the Italian sample of the TIMSS&PIRLS 2011 Combined International Database on fourth grade students. The multivariate approach jointly considers educational achievement in Reading, Mathematics and Science, thus allowing us to test for differential associations of the covariates with the three outcomes, and to estimate the residual correlations among pairs of outcomes within and between classes. Multilevel modelling allows us to disentangle student and contextual factors affecting achievement. We also account for territorial differences in wealth by means of an index from an external data source. The model residuals point out classes with high or low performance. As educational achievement is measured by plausible values, the estimates are obtained through multiple imputation formulas.

Article information

Ann. Appl. Stat., Volume 10, Number 4 (2016), 2405-2426.

Received: March 2016
Revised: July 2016
First available in Project Euclid: 5 January 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Hierarchical linear model large-scale assessment data multiple imputation plausible values school effectiveness secondary data analysis


Grilli, Leonardo; Pennoni, Fulvia; Rampichini, Carla; Romeo, Isabella. Exploiting TIMSS and PIRLS combined data: Multivariate multilevel modelling of student achievement. Ann. Appl. Stat. 10 (2016), no. 4, 2405--2426. doi:10.1214/16-AOAS988.

Export citation


  • Ammermueller, A. and Pischke, J. S. (2009). Peer effects in European primary schools: Evidence from the progress in international reading literacy study. J. Labor. Econ. 27 315–348.
  • Asparouhov, T. (2006). General multi-level modeling with sampling weights. Comm. Statist. Theory Methods 35 439–460.
  • Bartolucci, F., Pennoni, F. and Vittadini, G. (2011). Assessment of school performance through a multilevel latent Markov Rasch model. J. Educ. Behav. Stat. 36 491–522.
  • Bouhlila, D. S. and Sellaouti, F. (2013). Multiple imputation using chained equations for missing data in TIMSS: A case study. Large Scale Assess. Educ. 1 1–33.
  • Chiu, M. M. and Xihua, Z. (2008). Family and motivation effects on mathematics achievement: Analyses of students in 41 countries. Learn. Instr. 18 321–336.
  • Fox, J.-P. and Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika 66 271–288.
  • Foy, P. (2013). TIMSS and PIRLS 2011 user guide for the fourth grade combined international database. TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA. Available at
  • Foy, P. and O’Dwyer, L. M. (2013). Technical Appendix B. School effectiveness models and analyses. In TIMSS and PIRLS 2011 Relationships Among Reading, Mathematics, and Science Achievement at the Fourth Grade-Implications for Early Learning (M. O. Martin and V. S. Mullis, eds.). TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA. Available at
  • Goldstein, H. (2004). International comparisons of student attainment: Some issues arising from the PISA study. Assessment in Education: Principles, Policy & Practice 11 319–330.
  • Goldstein, H. (2011). Multilevel Statistical Models, 4th ed. Wiley, New York.
  • Goldstein, H., Carpenter, J. R. and Browne, W. J. (2014). Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. J. Roy. Statist. Soc. Ser. A 177 553–564.
  • Grady, M. W. and Beretvas, S. N. (2010). Incorporating student mobility in achievement growth modeling: A cross-classified multiple membership growth curve model. Multivar. Behav. Res. 45 393–419.
  • Grilli, L. and Rampichini, C. (2015). Specification of random effects in multilevel models: A review. Qual. Quant. 49 967–976.
  • Hammouri, H. A. M. (2004). Attitudinal and motivational variables related to mathematics achievement in Jordan: Findings from the third international mathematics and science study (TIMSS). Educ. Res. 46 241–257.
  • Hanushek, E. A. and Woessmann, L. (2011). The economics of international differences in educational achievement. In Handbook of the Economics of Education (E. A. Hanushek, S. Machin and L. Woessmann, eds.) 3. Elsevier, The Netherlands.
  • Istituto Tagliacarne (2011). Reddito e occupazione nelle province Italiane dal 1861 ad oggi. Istituto Tagliacarne, Roma.
  • Jerrim, J. and Micklewright, J. (2014). Socio-economic gradients in children cognitive skills: Are cross-country comparisons robust to who reports family background? Eur. Sociol. Rev. 30 766–781.
  • Johnson, M. S. and Jenkins, F. (2005). A Bayesian hierarchical model for large-scale educational surveys: An application to the international assessment of educational progress. ETS Research report RR-04-38. Educational Testing Service, Princeton, NJ.
  • Joncas, M. and Foy, P. (2013). Sample design in TIMSS and PIRLS. In Methods and Procedures in TIMSS and PIRLS 2011 (M. O. Martin and I. V. S. Mullis, eds.). TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA. Available at
  • Kirsch, I., de Jong, J., Lafontaine, D., McQueen, J., Mendelovits, J. and Monseur, C. (2002). Reading for Change. Performance and Engagement Across Countries. Results from Pisa 2000. OECD, Paris.
  • Kreiner, S. and Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika 79 210–231.
  • Kreuter, F., Eckman, S., Maaz, K. and Watermann, R. (2010). Children’s reports of parents’education level: Does it matter whom you ask and what you ask about. Surv. Res. Meth. 4 127–138.
  • Kyriakides, L. (2008). Testing the validity of the comprehensive model of educational effectiveness: A step towards the development of a dynamic model of effectiveness. Sch. Eff. Sch. Improv. 19 429–446.
  • Ladd, H. and Walsh, R. (2002). Implementing value-added measures of school effectiveness: Getting the incentives right. Econ. Educ. Rev. 21 1–17.
  • Li, K. H., Meng, X.-L., Raghunathan, T. E. and Rubin, D. B. (1991). Significance levels from repeated $p$-values with multiply-imputed data. Statist. Sinica 1 65–92.
  • Martin, M. O. and Mullis, I. V. S. (2012). Methods and procedures in TIMSS and PIRLS 2011. TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA.
  • Martin, M. O. and Mullis, I. V. S. (2013). Timss and Pirls 2011: Relationships Among Reading, Mathematics, and Science Achievement at the Fourth Grade-Implications for Early Learning. TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA.
  • Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika 56 177–196.
  • Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel modelling of complex survey data. J. Roy. Statist. Soc. Ser. A 169 805–827.
  • Raudenbush, S. W. and Willms, J. D. (1995). The estimation of school effects. J. Educ. Behav. Stat. 20 307–335.
  • Reeve, J. and Jang, H. (2006). What teachers say and do to support students’ autonomy during a learning activity. J. Educ. Psychol. 98 209–218.
  • Rubin, D. (2002). Multiple Imputation for Nonresponse in Sample Surveys. Wiley, New York.
  • Rutkowski, L., von Davier, M. and Rutkowski, D. (2014). Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis. Chapnam & Hall, Boca Raton.
  • Rutkowski, L., Gonzalez, E., Joncas, M. and von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educ. Res. 39 142–151.
  • Sani, C. and Grilli, L. (2011). Differential variability of test scores among schools: A multilevel analysis of the fifth-grade Invalsi test using heteroscedastic random effects. J. Appl. Quant. Meth. 6 88–99.
  • Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57 19–35.
  • Snijders, T. A. B. and Bosker, R. J. (2012). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd ed. Sage Publications, Los Angeles, CA.
  • StataCorp (2013). Stata: Release 13. Statistical Software. StataCorp LP, College Station, TX.
  • Stonge, J. H., Ward, T. J. and Grant, L. W. (2011). What makes good teachers good? A cross-case analysis of the connection between teacher effectiveness and student achievement. J. Teach. Educ. 62 339–355.
  • Tekwe, C., Carter, R., Ma, C., Algina, J., Lucas, M., Roth, J., Ariet, M., Fisher, T. and Resnick, M. (2004). An empirical comparison of statistical models for value-added assessment of school performance. J. Educ. Behav. Stat. 29 11–36.
  • Tranmer, M. and Steel, D. G. (2001). Ignoring a level in a multilevel model: Evidence from UK census data. Environ. Plann. A 33 941–948.
  • von Davier, M., Gonzalez, E. and Mislevy, R. (2009). What are plausible values and why are they useful? In ERI Monograph Series: Issues and Methodologies in Large-Scale Assessments (M. von Davier and D. Hastedt, eds.) 2 9–36.
  • Wang, Z., Osterlind, S. and Bergin, D. (2012). Building mathematics achievement models in four countries using TIMSS 2003. Int. J. Sci. Math. Educ. 10 1215–1242.
  • Weirich, S., Haag, N., Hecht, M., Böhme, K., Siegle, T. and Lüdtke, O. (2014). Nested multiple imputation in large-scale assessments. Large Scale Assess. Educ. 2 1–18.
  • Wu, M. (2005). The role of plausible values in large-scale surveys. Stud. Educ. Eval. 31 114–128.
  • Yang, M., Goldstein, H., Browne, W. and Woodhouse, G. (2002). Multivariate multilevel analyses of examination results. J. Roy. Statist. Soc. Ser. A 165 137–153.