Session Information
09 ONLINE 29 A, Linking and Equating Large-scale Assessment Scales
Paper Session
MeetingID: 892 3898 8535 Code: iHNdk7
Contribution
It is a well-known concern among researchers in the field of social sciences, that due to the cross-sectional survey designs of international large-scale assessments (ILSAs), it is difficult to draw causal inferences about the data. However, suggestions for drawing such inferences from ILSA data have been made by several researchers (e.g. Gustafsson, 2008; Robinson, 2013). These country-level longitudinal methods have encouraged researchers to revisit data from the early ILSAs (e.g. Chmielewski, 2019; Hanushek & Woessmann, 2012).
Gustafsson (2008) identified two phases of the International Association for the Evaluation of Educational Achievement (IEA), i.e. before and after 1990, when a new organization was set up. During the first phase, the IEA conducted separate ILSAs in mathematics and science on four occasions; data were collected on mathematics in 1964 and 1980-82 and on science in 1970-71 and 1983-84. In the second phase, the Third International Mathematics and Science Study in 1995 was the first IEA study to test mathematics and science together. The assessment has been repeated every fourth year, most recently in 2019. From 1999, the survey is named Trends in International Mathematics and Science Study (TIMSS).
The studies from the first phase have not officially been linked to the TIMSS reporting scale. Therefore, the purpose of the present study is to contribute such links and long-term scales for mathematics and science in grade eight. Previous research has shown that it is possible to link the cognitive outcomes from the two phases of IEA ILSAs on reading and mathematics (Afrassa, 2005; AUTHORS, 2021; Strietholt & Rosén, 2016). The studies on mathematics remained limited concerning the comparability with the TIMSS reporting scale and the scope of educational systems.
Kolen and Brennan (2014) argue that the usefulness of linking depends on the degree of similarity between assessments. They propose four criteria for establishing similarity: inferences, populations, constructs, and measurement characteristics. Thus first, we need to evaluate whether the two tests to be linked share common measurement goals to draw similar types of inferences. Thereafter the similarity of target populations should be considered. Third, for the test scores to be functionally related, the tests need to measure the same constructs. Finally, the measurement conditions, such as test length, test format, and administration need to be evaluated. If these criteria are found to be sufficiently fulfilled, the next step is to link the studies.
When statistical adjustments are made to scores on tests that are different in content and/or difficulty, Kolen and Brennan (2014) refer to the relationship between scores as a linking, using the terminology of Holland and Dorans (2006), Linn (1993), and Mislevy (1992). We use the term linking following also Mazzeo and von Davier (2013), who define linking scales as the process of achieving a scale of results produced by a sequence of assessments, which maintains a stable, comparable meaning over time.
Against this background, this study first investigates the utility of linking the mathematics and science studies administered before 1990 with the scale after 1990, that is the current TIMSS scales, by evaluating the degrees of similarity and the behaviour of the common items across assessments and in relation to the whole test. We refer to the common items that are repeated in succeeding assessments. Second, we place the assessments from the first phase of IEA on the TIMSS reporting scale, including comparing two linking approaches in terms of the amount of data and produced scores. Thereby we also extend previous research by including data from more educational systems and the subject domain science from the first phase of IEA ILSAs.
Method
In this study, we use student achievement data in mathematics and science for the populations representing 13-year-old (FIMS and SIMS), 14-year-old (FISS and SISS), and eighth-grade students (TIMSS 1995). We used data from all participating educational systems. We identified 37 common items bridging FIMS and SIMS and 18 items overlapping from SIMS to TIMSS 1995. Concerning the science studies, 19 identical items were administered in both FISS and SISS and 13 items that were repeated from SISS to TIMSS 1995. Before performing the linking procedures, we evaluated these bridges by testing the correlations of the sets of bridge items with the whole tests and in terms of differential item functioning (DIF). We used Angoff’s delta plot method (Angoff & Ford, 1973) to detect item parameter drift between cycles. Building on previous research (AUTHORS, 2021), two linking approaches were applied to construct the mathematics scales. In the four-country-all-studies approach, we used previously estimated item parameters, which were calibrated using the pooled data of four countries that participated in every administration from FIMS to TIMSS 2015. First, we estimated the student abilities separately for FIMS and SIMS, fixing the item parameters to these previously estimated values. Then we matched the distribution of the five plausible ability values (PVs) estimated for FIMS and SIMS with the reported TIMSS 1995 PVs. This was done by calculating transformation constants, following the TIMSS scale linking procedure (see e.g. (Martin et al., 2016). The other, first-second-study approach, involved the concurrent calibration of item parameters using FIMS and SIMS, fixing the bridge items’ parameters to the values reported for TIMSS 1995. These item parameters were reported after a rescaling procedure in the 1999 assessment cycle (Martin et al., 2000). Thus, unlike in the first linking approach, the IRT models were the same as those used in the TIMSS procedures. Then we matched the ability distribution of SIMS with the reported TIMSS 1995 scale and we matched the ability distributions between FIMS and SIMS. For the science scale, we only applied the second linking approach. Since the primary goal was to achieve scores that are put on the TIMSS scale, the emphasis was put on the link to 1995. In the first-second-study approach, it was possible to use data from all countries for item calibration. Consequently, the amount of information, i.e. the number of item responses was close to threefold in this approach than in the four-country-all-studies approach.
Expected Outcomes
The delta plot method was applied for the sets of bridge items and common populations in the mathematics assessments between (1) 1964-1980, (2) 1964-1995, and (3) 1980-1995, and the science studies between (4) 1970-1984, (5) 1970-1995, and (6) 1984-1995. Two items in the first, one item in the third, and two items in the fourth bridge were flagged for DIF. In the first-second-time approach, these items were treated as unique items. Strong individual-level correlations were found between mathematics scales. This relationship translated well to the country-level aggregation in all cases except for Japan. This finding potentially indicates cultural differences. Six educational systems sampled the same grades, i.e. eight years of schooling in the respective mathematics studies: England, France, Israel, the Netherlands, Scotland, and the United States. Results showed a large decline from FIMS to SIMS in the case of three educational systems: France, Israel, and England. The country-level changes from SIMS to 1995 showed a less sharp decline in the Netherlands, Israel, and Scotland, while no change in France and the United States, and a slight improvement in England. Five educational systems sampled the same grades, i.e. eight years of schooling in the respective science studies: Australia, England, Hungary, Italy, and Sweden. Two countries improved their performance from the first ILSA to 1995: Australia and Sweden. England, after a decline in 1984, reached the same performance as in 1970. Italy showed the most consistent performance. Hungary showed a considerable increase in 1984 and a large decline in 1995. The first-second-studies scales of the first-phase studies are published here: https://www.gu.se/en/center-for-comparative-analysis-of-educational-achievement-compeat/linking-projects/mathematics-and-science. It is important to emphasize that the sampling differences need to be considered when using the scales. An approach to account for these differences between time and countries is to treat age and grade level as plausible explanatory variables.
References
References Afrassa, T. M. (2005). Monitoring mathematics achievement over time: A secondary analysis of FIMS, SIMS and TIMS: A Rasch analysis. In Alagumalai, Curtis, David D. & N. Hungi (Eds.), Education in the Asia-Pacific Region: Vol. 4. Applied Rasch measurement: A book of exemplars: Papers in Honour of John P. Keeves (61-77). Springer. Angoff, W., & Ford, S. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10(2), 95–106. AUTHORS (2021). Chmielewski, A. K. (2019). The Global Increase in the Socioeconomic Achievement Gap, 1964 to 2015. American Sociological Review, 84(3), 517–544. https://doi.org/10.1177/0003122419847165 Gustafsson, J.‑E. (2008). Effects of International Comparative Studies on Educational Quality on the Quality of Educational Research. European Educational Research Journal, 7(1), 1–17. https://doi.org/10.2304/eerj.2008.7.1.1 Hanushek, E. A., & Woessmann, L. (2012). Do better schools lead to more growth? Cognitive skills, economic outcomes, and causation. Journal of Economic Growth, 17(4), 267–321. https://doi.org/10.1007/s10887-012-9081-x Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187–220). Praeger Publishers. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (Third Edition). Statistics for Social and Behavioral Sciences. Springer. Linn, R. L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1), 83–102. https://doi.org/10.1207/s15324818ame0601_5 Martin, M. O., Gregory, K. D., & Stemler, S. E. (Eds.). (2000). TIMSS1999 technical report. Boston College. Martin, M. O., Mullis, I. V. S., & Hooper, M. (Eds.). (2016). Methods and procedures in TIMSS 2015. Boston College. Mazzeo, J., & von Davier, M. (2013). Linking scales in international large-scale assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of International Large-Scale Assessment. Chapman and Hall/CRC. https://doi.org/10.1201/b16061-13 Mislevy, R. J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. ETS Policy Information Center. Robinson, J. P. (2013). Causal Inference and Comparative Analysis with Large-Scale Assessment Data. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of International Large-Scale Assessment (pp. 535–560). Chapman and Hall/CRC. https://doi.org/10.1201/b16061-26 Strietholt, R., & Rosén, M. (2016). Linking large-scale reading assessments: Measuring international trends over 40 years. Measurement: Interdisciplinary Research and Perspectives, 14(1), 1–26. https://doi.org/10.1080/15366367.2015.1112711
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.