Session Information
09 SES 14 B, Psychometric Approaches to Fair and Valid Achievement Assessment
Paper Session
Contribution
This research examines fairness and consistency in the marking of constructed response items, a concern that extends to any assessment requiring subject scoring. Variations in marker severity have been consistently observed across various assessment systems. These discrepancies are particularly evident in the scoring of constructed response items, such as essays. Errors in expert judgment—such as leniency, severity, halo effect, central tendency, and restriction of range—introduce systematic variations in awarded scores, potentially compromising the validity of score interpretations. For instance, Engelhard (1994) and Zhang (2013) have detailed these marker errors, highlighting their impact on assessment outcomes.
This study, conducted by an English Exam Board, focuses on the marking of GCSE English Language. This is a qualification taken predominantly by students aged sixteen years old, and it is a crucial qualification to gain employment and access to further or higher education. The scale of the dataset available to an exam board provides unique insights into marker behaviour, with findings that hold relevance for international assessment contexts. The high-stakes nature of this assessment makes fairness and consistency a key consideration, and there are regulatory requirements for the delivery and marking which must be met.
The primary focus of the work is to evaluate marker bias and variability. We consider markers to demonstrate bias if their marking displays patterns of severity or leniency. Variability measures the reliability of markers and their consistency in approach. Low variability reflects dependable, consistent markers. Bias and variability in marker behaviour can significantly impact student outcomes (Bramley, 2007; Newton, 2010). These concepts form our research questions – ‘is there evidence of marker biases, such as severity and leniency’ and ‘how reliable are the markers?’
The GCSE English Language Examination comprises five free-response questions, each requiring an essay. A unique challenge in this context is that each student's essay is scored by a single marker, resulting in markers being nested within the five questions without overlap. This incomplete block marking design creates sparse data matrices with significant missing data, posing challenges for traditional IRT-based analyses. To address the lack of connectivity among markers, the study uses an approach which calibrates marker severity.
Beyond examining marker severity, the study also investigated interactions between marker severity and marker role (e.g., principal examiner, team leader, examiner, new examiner) to identify potential biases in essay ratings. Research has shown that experience and training are crucial determinants of marker reliability (Meadows & Billington, 2005; Suto et al., 2008). Understanding how different levels of marker seniority influence bias and consistency can contribute to improving marker training and quality assurance protocols. These analyses provided insights into possible sources of bias, contributing to a more equitable assessment process.
By leveraging advanced IRT-based methodologies, this study provides actionable insights into marker bias and reliability, offering valuable implications for improving assessment fairness. These findings support a fairer scoring system, ensuring that each student’s grade reflects their true achievement, regardless of the assigned marker. This study offers insights to enhance fairness, transparency, and consistency in scoring for international assessments. It also supports ongoing international efforts to improve marker training, refine mark schemes, and explore statistical models for monitoring assessment quality.
Method
In the field of educational assessment, ensuring the validity and reliability of rating systems is of paramount importance, particularly in high-stakes examinations. Traditional methods of scoring and evaluating assessments often rely on subjective judgment, introducing potential bias from markers, variability across items, and inconsistencies in student responses. To address these concerns, advanced measurement approaches, have emerged as robust tools for analysing marker-mediated assessments. The Hierarchical Rater Model (HRM) has emerged as a powerful tool for analysing marker effects within a nested, multilevel structure (Patz & Junker, 1999; Wilson, 2005). HRM extends traditional measurement models by accounting for hierarchical dependencies in marker data, such as multiple markers scoring the same examinees across different tasks. This model simultaneously estimates item difficulty, student ability, and marker severity while capturing variations both within and across markers. HRM is particularly advantageous when ratings are influenced by group-level factors, such as marker training, experience, or contextual biases. By modelling these dependencies explicitly, HRM provides a nuanced understanding of marker behaviour, reducing error and improving the precision of student scores (Junker & Patz, 2001). In our data each question was only marked by one marker. This lack of overlap introduced complexity into the modelling process and influenced our choice of methodology. To address these challenges, we adopted the Hierarchical Rater Model (HRM), which proved to be well-suited to the structure of our data. The HRM’s ability to account for group-level factors and its flexible structure made it particularly effective in handling the unique characteristics of our dataset.
Expected Outcomes
This study evaluates the evidence for marker bias and the reliability of marking, employing advanced statistical models such as the Hierarchical Rater Model. Our work also considered the role of marker seniority in ensuring reliability. In terms of our first research question about marker bias, we found that most markers exhibit neutral bias. In Paper 1, there was some evidence of lenient marking. However, most markers aligned with the expected marking standard. For Paper 2 there were minimal occurrences of leniency or severity which shows that markers were consistent in their marking In regard to marker reliability, we found moderate levels of inconsistency. Paper 1 showed greater reliability with fewer markers demonstrating extreme variability. For both papers, those with more marking categories shows evidence of higher unreliability. Team leaders consistently aligned with ideal marking standards, showing reduced bias and greater consistency, while new markers faced challenges with greater inconsistency and both lenient and severe biases. These findings align with previous research suggesting that experience and training are key determinants of marker reliability (Meadows & Billington, 2005; Suto et al., 2008). This research underscores the importance of enhancing marker training programs and refining mark schemes to promote uniformity in interpretation and application. In line with previous studies (Johnson & Black, 2012; Benton et al., 2020), findings also highlight the potential of advanced statistical models to monitor real-time marker performance.
References
Benton, T., Leech, T., & Wheadon, C. (2020). The influence of item design on marking reliability in constructed-response items. Cambridge Assessment Research Report. Bramley, T. (2007). Paired comparison methods in the assessment of writing: Reliability, validity, and bias. Research Papers in Education, 22(2), 137–156. Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x Johnson, M., & Black, B. (2012). The role of training in ensuring marking reliability: A review of evidence. Educational Research and Evaluation, 18(3), 245–270. Junker, B. W., & Patz, R. J. (2001). Hierarchical rater models. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp. 271–288). Springer. Meadows, M., & Billington, L. (2005). A review of literature on marking reliability. Ofqual Research Paper. Newton, P. E. (2010). Educational assessment: Concepts and issues. Bloomsbury Publishing. Patz, R. J., & Junker, B. W. (1999). Fitting item response models to incomplete, multiple item type educational assessment data using Markov chain Monte Carlo methods. Journal of Educational and Behavioral Statistics, 24(2), 146–178. Suto, I., Nádas, R., & Bell, J. (2008). Who should mark what? A study of factors affecting marking accuracy in a biology examination. Research Papers in Education, 23(4), 477–497. Wilson, M. (2005). Constructing measures: An item response modeling approach. Routledge.
Update Modus of this Database
The current conference programme can be browsed in the conference management system (conftool) and, closer to the conference, in the conference app.
This database will be updated with the conference data after ECER.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance, please use the conference app, which will be issued some weeks before the conference and the conference agenda provided in conftool.
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.