Session Information
09 ONLINE 29 A, Linking and Equating Large-scale Assessment Scales
Paper Session
MeetingID: 892 3898 8535 Code: iHNdk7
Contribution
In large-scale assessments that involve multiple test forms, individual student test scores are expected to be comparable regardless of which form the students take (DePacale & Gong, 2020). Test scores are considered comparable if the same interpretations can be made, with the same level of confidence, from variations of the same test (Winter et al., 2010). In relation to high-stakes testing programmes, a more stringent claim for “interchangeability” has been raised (Dorans & Walker, 2007; Holland, 2007). According to the Standards of for Educational and Psychological Testing (AERA, APA, & NCME, 2014), appropriate statistical methodology should be applied to ensure that test scores from alternate test forms can be used interchangeably, which often involves score equating (DePacale & Gong, 2020; Holland, 2007).
The traditional way to equate scores from alternate test forms is to include anchor items. These are administered to all test takers and are used to adjust for possible differences in ability between the groups taking different forms (Kolen & Brennan, 2004, von Davier, 2011, von Davier, 2013, González & Wiberg, 2017). In the absence of anchor items, one approach is to substitute anchor test scores by covariates, such as grades or other test scores (Wiberg & Bränberg, 2015, Longford, 2015, Wallin & Wiberg 2019). The key assumption is that the covariates are able to explain differences in ability. This method was designed for situations when groups that take different test forms cannot be considered equivalent, but may improve test score comparability in case of equivalent groups as well. This method is, however, threatened if some of the covariates themselves are measured using multiple test forms.
This study is motivated by the Czech Republic’s national matura exam (upper secondary school leaving exam). The exam is composed of four school subjects from which only Czech Language is mandatory for all. The students then choose between Mathematics and Foreign Language (mostly English), and two additional subjects from a wider selection. The majority of students take the exam in spring, those who were not admitted (i.e., due to low school performance), missed the exam, or failed have another possibility in the autumn term. The groups of students who take the exam in spring and autumn thus differ in their characteristic, and also in their matura test scores. Evidence of test score comparability between the two examination terms is essential for fairness of the assessment, given its high-stakes nature (AERA, APA, & NCME, 2014). In addition, comparability of test scores from the same subject over multiple years is desirable to allow for monitoring trends in student achievement. In our data, comparability over multiple years is an issue even in spring terms, as the composition of students taking the tests in non-mandatory subjects (including English or Mathematics) vary from year to year, depending on changing preferences and possible changes in organizational arrangements which may influence students’ decision about which subject they select.
In this work we use repeated covariate equating to provide proof of comparability in a situation when no anchor items are available, the groups are not equivalent, and both the construct of interest and the covariates are measured using multiple test forms. We argue that the method we propose in this submission provides more accurate equating scores than simple covariate equating.
Method
We use data from the English Language matura tests assigned in spring and autumn terms in years 2016-2019. We equate the scores by means of kernel equating for nonequivalent groups (Wiberg & Bränberg, 2015, Wallin & Wiberg, 2019) with type of school, gender and score in Czech Language matura test as covariates. We demonstrate how improved information on student characteristics (i.e., adding more covariates) helps providing the proofs that multiple forms of the test are of comparable difficulty. Because the Czech Language test score is itself measured using different test forms, we perform equating on this score and pursue the same analyses with the equated Czech Language test score. We use equivalent group design (Kolen & Brennan, 2004, von Davier, 2011, von Davier, 2013, González & Wiberg, 2017) to equate Czech Language test scores for all spring terms over multiple years. The assumption of equivalent groups is plausible here given that the Czech Language matura exam is mandatory for all students. We use covariate equating to equate the spring and autumn terms, the covariates being the school type, and information on whether this is the first or repeated attempt to pass the exam. We compare the resulting equated scores for English Language obtained (a) without equating the Czech Language test scores, and (b) with equating. We show how aligned are the distributions of equated scores when different sets of covariates are implemented in covariate equating, and when the covariate measured using different test forms is itself equated before entering the covariate equating algorithm.
Expected Outcomes
We expect that even if multiple test forms are designed to be of similar difficulty, the distribution of total test scores will be different if the groups taking the test differ dramatically in their characteristics. We assume that, when the characteristics of the tested group are taken into account, comparability of alternate test forms can be proved. We also assume that the more covariates are used, the better. Finally, we assume that if some of the covariates are measured using multiple test forms, equating them prior to their inclusion into the covariate equating process will help provide stronger proofs of test forms comparability. We discuss the implications of our conclusions for high-stakes testing programmes, such as national exams.
References
AERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: AERA. DePascale, C., & Gong, B. (2020). Comparability of individual students‘scores on the “same test“. In A. I. Berman, E. H. Haertel, & J. W. Pellegrino (Eds.), Comparability of large-scale educational assessments: issues and recommendations (pp. 25–48). Washington, DC: National Academy of Education. Dorans, N. J., & Walker, M. E. (2007). Sizing up linkages. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 180–198). New York: Springer. González, J., & Wiberg, M. (2017). Applying test equating methods using R. Cham, Switzerland: Springer. Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York: Springer Kolen, M. J., & Brennan, R. J. (2004). Test equating: methods and practices (2nd ed.). New York: Springer. Longford, N. T. (2015). Equating without an anchor for nonequivalent groups of examinees. Journal of Educational and Behavioral Statistics, 40, 227–253. von Davier, A.A. (2013). Observed-score equating: An overview. Psychometrika 78(4), 605–623. von Davier, A.A. (2011). Statistical models for test equating, scaling, and linking. New York: Springer. Wallin, G., & Wiberg, M. (2019). Kernel equating using propensity scores for nonequivalent groups. Journal of Educational and Behavioral Statistics, 44(4), 390–414. Wiberg, M., & Bränberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design. Applied Psychological Measurement, 39, 349–361. Winter, P. C. (2010). Comparability and test variations. In P. C. Winter (Ed.), Evaluating the comparability of scores from achievement test variations (pp. 1-11). Washington, DC: Council of Chief State School Officers.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.