In a challenging time in world history, questions arise about how education systems can adapt to rapidly changing student populations. Cyclic international large-scale assessments (ILSAs), such as TIMSS (Trends in International Mathematics and Science Study), PIRLS (Progress in International Reading Literacy Study), and PISA (Programme for International Student Assessment), seek to provide information on student ability and background. Of particular interest is understanding how policy-decisions may be made about ever increasingly diverse student populations. This study seeks to investigate the comparability of background information obtained from students, teachers, and principals in ILSA studies.
Questionnaire results reported in ILSAs are often based on latent (not observable) constructs, or scales, such as students’ academic self-concept, teachers’ beliefs, or teaching methods. Scales in TIMSS and PIRLS typically consist of several items from the background questionnaires, which are aggregated to create a single score for a scale, e.g. Self-Efficacy, which are then averaged at the country level for reporting purposes. Countries are then ranked in tables based on these averages in the studies’ reports. The current statistical procedure applied to create scores of the latent constructs is based on the assumption that scales are comparable across different groups, meaning that all parameters are fixed across diverse populations.
There is an increasing emphasis on cross-cultural comparability of latent constructs, yet there are no standard procedures, and invariance testing is not common, despite evidence that psychological constructs as well as the response behavior on items are subject to cultural differences (Hamamura, Heine, and Paulhaus, 2008; VanHerk, Poortinga, and Verhallen, 2004); this is especially true for the non-cognitive scales found in background questionnaires. Measurement invariance testing examines how items contributing to a latent construct, or scale, function differently for different populations (Meredith, 1993). Investigation into measurement invariance seeks to measure if difference in response behavior affect a scale’s factor score estimation to a substantial degree, thereby reducing the level of its comparability across certain groups (e.g. countries). If a minimum level of measurement invariance is not achieved, then the latent constructs assessed in a study should not be directly compared between populations.
This study seeks to provide insights into the cross-cultural comparability of latent scales used in the international studies TIMSS and PIRLS across diverse populations. The scales used in this study area selection found in the background questionnaires in the TIMSS 2015 and PIRLS 2016 international studies that are particularly sensitive to a diverse range of experience, such as teacher work satisfaction, students’ perception of teaching methods, and their self-perception of abilities.
National education policies are sometimes affected by the comparisons made in international large-scale studies such as TIMSS, PIRLS, PISA, and others (e.g. Grek, 2009). Without a clear understanding of how such data functions across the diverse international samples and their target populations, national education policy makers may receive an incomplete or misleading picture of their country’s progress on international measures when compared to other participating countries. Furthermore, these policy makers may also be unaware of the comparisons that are possible with such data. Measurement invariance testing can not only provide a more accurate comparison, but also warn when certain comparisons are not advisable, in cases when a latent construct (or scale) is not directly comparable across populations. This study seeks not only to increase the methodological rigor of scale score creation in international studies, but also provide more accurate information for policy makers using international large-scale study data.