Session Information
09 SES 14 B, Psychometric Approaches to Fair and Valid Achievement Assessment
Paper Session
Contribution
This study explores the linking of two versions of a nationwide exam, which serves as both a 12th-grade certification and a criterion for competitive higher education admissions. The mixed format exam includes multiple-choice items scored dichotomously and open-ended items scored polytomously, some of them are compulsory, others are optional. In the light of the literature on educational measurement, which refers to the research and theoretical frameworks related to the evaluation of educational outcomes, performance, and abilities, among other related aspects, the lack of evidence on standard procedures goes against the use that has been made of examination results, particularly with a high impact on the lives of young people applying to higher education programmes (AERA-American Educational Research Association et al., 2014; Hambleton, 2004; Kolen & Brennan, 2014). Since the result of the national examination has consequences on who is admitted to public higher education and on what programme/institution, the implementation of standard procedures for educational measurement is a relevant policy topic. The primary goal of this study is to contribute to obtain linked scores when administering two exam forms with mixed item types, enhancing both fairness and accuracy. Item Response Theory (IRT) is the most appropriate theoretical framework, as it aligns directly with the objectives of scale equating/linking, mixed-format test analysis, measurement invariance, and ensuring fairness in high-stakes exams. Five key aspects highlight its suitability: (1) Modeling latent traits - IRT is specifically designed to model the relationship between latent traits (e.g., students’ knowledge or skills) and item responses; (2) Calibration of mixed-format tests - IRT model estimation procedure enables the simultaneous calibration of two exam forms with mixed item types, such as dichotomously and polytomously scored items; (3) Ensuring measurement invariance - IRT offers statistical tools to test for differential item functioning (DIF) and ensure measurement invariance across groups or exam forms, a critical requirement for fairness; (4) Anchor items identification - IRT is particularly effective for identifying anchor items, which are essential for linking and equating different exam forms accurately; (5) Reducing measurement error - by estimating individual item parameters and incorporating them into a common scale, IRT reduces measurement error and enhances the precision of scoring. This combination of characteristics makes IRT an indispensable framework for achieving reliable and equitable outcomes in the context of high-stakes educational assessments.
Method
In Portugal, the Educational Assessment Institute [Instituto de Avaliação Educativa (IAVE)] oversees nationwide examinations (Decreto-Lei No102/2013, Criação Do IAVE, I.P., 2013). Admission to public higher education institutions is based on the final national secondary school exams [ENES, Exame Nacional do Ensino Secundário], as regulated by the National Commission for Access to Higher Education (CNAES). The ENES exams eligible for use in each stage of the competition are determined by CNAES decisions (Portaria No 183-B/2022. Regulamento Do Concurso Nacional de Acesso e Ingresso No Ensino Superior Público Para a Matrícula e Inscrição No Ano Letivo de 2022-2023 [Regulations for the National Competition for Access and Admission to Public Higher Education , 2022). Classical Test Theory (CTT) has been the primary framework used by IAVE for exam scoring. To our knowledge, all validation studies of the ENES scale conducted thus far have relied on a CTT approach to evaluate reliability, validity, and item difficulty. However, research suggests that incorporating Item Response Theory (IRT), either alone or in combination with CTT, could enhance the scale’s edumetric properties. The data used in this study refer to the administration of the 2022 ENES in Mathematics A (code 635), specifically the 1st stage (Form 1) and 2nd stage (Form 2). These data were provided under Protocols DGEEC 5/2020 and 7/2023, in compliance with the General Data Protection Regulation (GDPR). Most of the examinees who took form 2 (at 2nd stage) also had taken form 1 (at 1st stage). Data collected for these examinees (n=6097) are used for linking. The IRT model-based approach addresses the critical need for comparability between different test forms, with different item types, administered at different times. A mixed IRT model, combining the two-parameter logistic model (Birnbaum, 1968; Lord, 1980) and the graded response model (Samejima, 1997) is applied. This methodological framework enhances measurement invariance, reduces measurement error, and ultimately increases reliability and fair scoring. Key components of the approach include the identification of no DIF amongst the compulsory items of the forms, identification of anchor items, item parameters and latent trait estimation, the evaluation of model fit, and comparisons with alternative methodological approaches. We used the open-source R package MIRT (Chalmers, 2012, 2019) for data modelling.
Expected Outcomes
Based on a sample of approximately 6,000 students who took both exam forms and scored at least 95 (on a 0-200 scale) on the first form, preliminary results from DIF hypothesis tests indicate that, at a 5% significance level, the null hypothesis is not rejected for more than 30% of the items. This suggests that a subset of these items can serve as anchor items. In addition, when stakes are high, examinees are typically more motivated to put in greater effort. Given that our study might be influenced by variations in the examinees’ purposes — for example, those applying to highly competitive programs, those applying to other programs, and those ineligible for higher education , the reference and focal groups DIF testing were defined as general performers and high achievers, using a threshold of 160 on a 95–200 scale. The results contribute to improving the reliability, fairness, and transparency of educational assessments, particularly in high-stakes contexts such as admission to highly competitive higher education programs. These advancements are essential for fostering trust and equity in the selection process.
References
AERA-American Educational Research Association, APA-American PsychologicalAssociation, & NCME-National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores (pp. 395–480). Addison-Wesley. Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6). https://doi.org/10.18637/jss.v048.i06 Chalmers, R. P. (2019). Package ‘mirt.’ Hambleton, R. K. (2004). Theory, methods, and practices in testing for the 21st century. Psicothema, 16, 696–701. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices. Springer. Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Erlbaum. Decreto-Lei no102/2013, criação do IAVE, I.P., 4400 (2013). Portaria no 183-B/2022. Regulamento do concurso nacional de acesso e ingresso no ensino superior público para a matrícula e inscrição no ano letivo de 2022-2023 [Regulations for the national competition for access and admission to public higher education , 139 Diário da República, 1.a série 16 (2022). Samejima, F. (1997). Graded Response Model. In H. R. K. van der Linden W.J. (Ed.), Handbook of Modern Item Response Theory. Springer.
Update Modus of this Database
The current conference programme can be browsed in the conference management system (conftool) and, closer to the conference, in the conference app.
This database will be updated with the conference data after ECER.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance, please use the conference app, which will be issued some weeks before the conference and the conference agenda provided in conftool.
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.