Session Information
09 SES 02 B, Investigating the Validity of TIMSS & PIRLS
Paper Session
Contribution
In a challenging time in world history, questions arise about how education systems can adapt to rapidly changing student populations. Cyclic international large-scale assessments (ILSAs), such as TIMSS (Trends in International Mathematics and Science Study), PIRLS (Progress in International Reading Literacy Study), and PISA (Programme for International Student Assessment), seek to provide information on student ability and background. Of particular interest is understanding how policy-decisions may be made about ever increasingly diverse student populations. This study seeks to investigate the comparability of background information obtained from students, teachers, and principals in ILSA studies.
Questionnaire results reported in ILSAs are often based on latent (not observable) constructs, or scales, such as students’ academic self-concept, teachers’ beliefs, or teaching methods. Scales in TIMSS and PIRLS typically consist of several items from the background questionnaires, which are aggregated to create a single score for a scale, e.g. Self-Efficacy, which are then averaged at the country level for reporting purposes. Countries are then ranked in tables based on these averages in the studies’ reports. The current statistical procedure applied to create scores of the latent constructs is based on the assumption that scales are comparable across different groups, meaning that all parameters are fixed across diverse populations.
There is an increasing emphasis on cross-cultural comparability of latent constructs, yet there are no standard procedures, and invariance testing is not common, despite evidence that psychological constructs as well as the response behavior on items are subject to cultural differences (Hamamura, Heine, and Paulhaus, 2008; VanHerk, Poortinga, and Verhallen, 2004); this is especially true for the non-cognitive scales found in background questionnaires. Measurement invariance testing examines how items contributing to a latent construct, or scale, function differently for different populations (Meredith, 1993). Investigation into measurement invariance seeks to measure if difference in response behavior affect a scale’s factor score estimation to a substantial degree, thereby reducing the level of its comparability across certain groups (e.g. countries). If a minimum level of measurement invariance is not achieved, then the latent constructs assessed in a study should not be directly compared between populations.
This study seeks to provide insights into the cross-cultural comparability of latent scales used in the international studies TIMSS and PIRLS across diverse populations. The scales used in this study area selection found in the background questionnaires in the TIMSS 2015 and PIRLS 2016 international studies that are particularly sensitive to a diverse range of experience, such as teacher work satisfaction, students’ perception of teaching methods, and their self-perception of abilities.
National education policies are sometimes affected by the comparisons made in international large-scale studies such as TIMSS, PIRLS, PISA, and others (e.g. Grek, 2009). Without a clear understanding of how such data functions across the diverse international samples and their target populations, national education policy makers may receive an incomplete or misleading picture of their country’s progress on international measures when compared to other participating countries. Furthermore, these policy makers may also be unaware of the comparisons that are possible with such data. Measurement invariance testing can not only provide a more accurate comparison, but also warn when certain comparisons are not advisable, in cases when a latent construct (or scale) is not directly comparable across populations. This study seeks not only to increase the methodological rigor of scale score creation in international studies, but also provide more accurate information for policy makers using international large-scale study data.
Method
Both the TIMSS 2015 and PIRLS 2016 studies use similar methods to create factor scores for scales (latent constructs) derived from the contextual questionnaires. Using a Partial Credit Model (PCM) within the Item Response Theory (IRT) framework, the studies estimate individual participants’ factor scores for each scale. These studies then present the scores in a ranked table for each scale, listing the participating countries’ average scale scores from highest to lowest. In this study we use several existing scales in TIMSS and PIRLS that are assumed to be particularly sensitive to cultural differences. The analysis consists of three steps: first, measurement invariance testing is performed applying a Confirmatory Factor Analysis (CFA) approach. If a model achieves the highest level of measurement invariance, known as scalar invariance, it can be used for comparisons across populations. However, if lower levels of measurement invariance are reached, then only some parameters from the model may be directly compared and appropriate cautions of interpretation are given to the data user. Based on previous research on the TALIS latent scales (Rutkowski and Svetina, 2014, 2017; Cigler, Stancel-Piątak, and Chen, 2019), it is likely that not all of the scales will prove to be comparable across countries and/or their sub-populations. If a scale’s model does not hold the assumptions of invariance, a second step is applied to further understand the scale’s comparability. This second step applies partial measurement invariance testing (Byrne, Shavelson, and Muthén, 1989) using a “partial-by-item-and-country” approach proposed by Cigler, Stancel-Piątak and Chen (2019). In step three, models constructed within the partial measurement invariance framework are expected to produce different results than models constructed under the assumption of full invariance. These scores are used to compare the rankings obtained under partial measurement invariance considerations to those reported in TIMSS 2015 and PIRLS 2016 in an effort to understand how changes in the model construction related to a partial invariance approach may influence the interpretation of scale scores.
Expected Outcomes
Based on prior research, we assume that some of the latent constructs will not be directly comparable across countries and/or their sub-groups using the approach of full measurement invariance. We expect some differences across cultures, however we assume that these differences can be accounted for using the partial measurement invariance approach. Although, we do not expect major differences in the country ranking, when scales from the full measurement invariance and partial measurement invariance approaches are compared, we still recommend to use the partial measurement invariance approach as it produces less biased and more reliable point estimates. Furthermore, the interpretation of the results is more realistic due to the fact that the cross-group differences in the response behaviors are, to a greater extent, accounted for in the model. The steps outlined in the analysis contribute to enhancing the appropriateness of the interpretation of the scale scores, particularly with regard to the appropriateness of comparisons of group averages within each scale. Beyond the methodological considerations, this should aid researchers in making methodologically sound recommendations for policy makers, specifically when thinking about viable changes to the national education systems based on the results from international large-scale studies.
References
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological bulletin, 105(3), 456. Cigler, H., Stancel-Piątak, A., & Chen, M. (2019) The “partial-by-item-and-country” approach to measurement invariance in TALIS 2018. Manuscript in progress. Grek, S. (2009). Governing by numbers: The PISA ‘effect’in Europe. Journal of education policy, 24(1), 23-37. Hamamura, T., Heine, S. J., & Paulhus, D. L. (2008). Cultural differences in response styles: The role of dialectical thinking. Personality and Individual differences, 44(4), 932-942. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74(1), 31-57. Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30(1), 39-51.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.