The Decision Making Processes of Assessors of High Stakes Performance Assessments

Author(s):

Conor Scully(presenting / submitting)

Conference:

ECER 2022

Network:

99. Emerging Researchers' Group (for presentation at Emerging Researchers' Conference)

Format:

Paper

Session Information

99 ERC ONLINE 21 C, Research in Education

Paper Session
Meeting-ID: 836 8291 6296 Code: 4PGV4V

Time:

2022-09-01

16:00-17:30

Room:

n/a

Chair:

Burcu Toptas

Contribution

The Objective Structured Clinical Examination (OSCE) is a performance assessment common in the health sciences. In a traditional OSCE, a test-taker moves through an examination hall, completing a series of stations, at which they have to perform a specific task, or series of tasks. At each station, test-takers are judged by a trained examiner, who awards them a grade on the basis of a marking guide that is specific to that station. Test-takers may be assessed on a wide variety of skills, from simple tasks such as hand-washing to complex procedures such as taking a patient’s history (Khan et al., 2013a). Some stations involve the use of Standardised Patients (SPs), who are actors trained to display specific symptoms, or to act in a certain way, to every test-taker who completes that station. A key advantage of the OSCE is that all students who take the exam complete exactly the same set of stations, under the same conditions, and are judged according to a standardised set of criteria (the marking guides).

The question of whether the OSCE produces reliable scores is an important one. The standardisation inherent in the OSCE is (in part) to ensure that OSCE scores have sufficient reliability such that valid decisions can be made on their basis. One method of calculating reliability is inter-rater reliability, namely the agreement between different human raters when they assess the same thing (Stemler, 2004). In theory, it should be possible to train assessors such that a single student performance would receive the same score from every assessor. However, multiple researchers have pointed out that there are a range of factors affecting how assessors determine whether a student has performed well or not, and that efforts to completely remove all these factors will be unsuccessful (e.g.: Gingerich et al., 2014). In the literature, there is an increasing consensus that assessors are “active information processors who interpret and construct their own personal reality of the assessment context” (Govaerts & van der Vleuten, 2013, p.1169).

This ongoing doctoral research project attempts to determine how assessors of undergraduate nursing OSCEs make decisions about student performance levels, and whether these decisions have implications for score reliability. In order to investigate these issues, a mixed-methods study is being conducted (Burke Johnson & Onwuegbuzie, 2004). The researcher has recorded a series of videos which depict student nurses completing two OSCE stations: blood pressure measurement and naso-gastric tube insertion. Assessors will be shown these videos and instructed to “think aloud” (Ericsson & Simon, 1980) which watching them, in order to ascertain how they observe and interpret the performance levels. Additionally, assessors will be instructed to complete the marking guides for each OSCE station, which will allow for a calculation of inter-rater reliability across assessors (Stemler, 2004).

The results from this study will add to the discussion around the design of performance assessments, particularly in the health sciences. However, it also has relevance within educational contexts. The teacher Selection Project (TSP) is an international research project funded by the European Research Council that aims to develop “contextualised teacher selection methods based on robust research” (TSP, 2019). The TSP (2019) is working to develop OSCE-style protocols for the selection of prospective teachers to programmes of initial teacher education (Rushby & Granger, 2020). Therefore, any work on the use of OSCEs appears to be potentially relevant within the field of education. Furthermore, this research project touches on many issues important in educational research, particularly regarding assessment: validity, reliability, and assessment design. As the OSCE continues to proliferate around the globe, it is vital that this spread is backed up with high-quality research.

Method

This study is utilising a mixed-methods approach to investigate assessors’ judgement mechanisms. Under the mixed-methods typology devised by Burke Johnson & Onquegbuzie (2004), the study is QUAL + quan; namely, the bulk of the collected data will be qualitative, augmented by a smaller amount of quantitative data collected concurrently. In line with previous research in which assessors’ decisions about students were investigated, this study uses a combination of a semi-structured interview and a think-aloud protocol in which assessors (n=15) vocalise their thought processes while watching videos of students completing an OSCE (Hyde et al., 2020; Roberts et al, 2020). The interview questions were informed by key concepts discussed in previous studies: observation, interpretation, expertise and variance. For the think aloud protocol, assessors will watch a series of four videos: two nursing students each completing two OSCE stations (blood pressure measurement and naso-gastric tube insertion). Although verbal reports can never be considered a completely accurate reflection of an individual’s mental processing, numerous studies have documented their utility when investigating the issue of judgement mechanisms (e.g.: Roberts et al., 2020). Qualitative data will be analysed using thematic analysis. Additionally, assessors will be asked to complete a marking guide for each station, comprising of a series of checklist items marked in a binary done/not done fashion; as well as two global judgements about the student’s communication skills and overall performance (marked as fail/borderline pass/good/excellent). Calculating inter-rater reliability will allow for an identification of whether specific assessors are harsher or more lenient than others, as well as whether certain checklist items are more likely to cause disagreement between assessors (Stemler, 2004). Triangulating the qualitative and quantitative data will allow for a deeper understanding of why specific items might be more prone to unwanted score variance.

Expected Outcomes

Research into assessors’ cognitive processes, regardless of discipline, has proliferated in the last decade, driven by the realisation that no amount of training is likely to completely mitigate the range of individual factors that affect how an individual assessor forms a judgement about a student’s performance (Gingerich et al., 2014). As such, there is an emerging understanding that complete agreement between assessors might be impossible to attain. However, this does not mean that improvements to assessment design (by aligning the grading process with how assessors actually make decisions in practice) cannot remove significant amounts of unwanted score variance. Recent years have seen the OSCE become more common in undergraduate assessments across the world; they are now used in over 50 countries (Patricio et al, 2013). investigating how these assessments are graded will add to the empirical justification for their expanded use. However, as noted earlier, assessors’ judgements processes are an increased area of focus in contexts beyond health sciences education, with implications for assessment across educational environments more broadly. It is expected that this investigation will have implications for validity, reliability, and overall assessment quality. The last two years has seen the movement of performance assessments such as the OSCE to online provision (Kunutsor et al., 2021). In order for the scores produced by such assessments to be valid, it is important that researchers are able to determine how assessors are making decisions about students, and whether there is a discrepancy in how these decisions are made in the online format compared to traditional, face-to-face assessments. Additionally, as the OSCE becomes a global assessment, investigating assessors’ judgements will add to the discussion around the extent to which such decisions made in a way that is universal, or whether local factors will always mitigate and inform the grades awarded to students.

References

Patrício, M. F., Julião, M., Fareleira, F., & Carneiro, A. V. (2013). Is the OSCE a feasible tool to assess competencies in undergraduate medical education? Medical Teacher, 35, 503–514 Gingerich, A., Van Der Vleuten, C. P. M., Eva, K. W., & Regehr, G. (2014). More consensus than idiosyncrasy: Categorizing social judgments to examine variability in Mini-CEX ratings. Academic Medicine, 89(11), 1510–1519. Govaerts, M., & van der Vleuten, C. P. (2013). Validity in work-based assessment: Expanding our horizons. Medical Education, 47(12), 1164–1174. Hyde, C., Yardley, S., Lefroy, J., Gay, S., & McKinley, R.K. (2020). Clinical assessors’ working conceptualisations of undergraduate consultation skills: a framework analysis of how assessors make expert judgements in practice. Advances in Health Science Education. Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed Methods Research: A Research Paradigm Whose Time Has Come. Educational Researcher, 33(7), 14–26. Khan, K. Z., Gaunt, K., Ramachandran, S., & Pushkar, P. (2013). The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part II: Organisation & Administration. Medical Teacher, 35(9). Kunutsor, S., Metcalf, E., Westacott, R., Revell, L. & Blythe, A. (2021). Are remote clinical assessments a feasible and acceptable method of assessment? A systematic review. Medical Teacher. Roberts, R., Cook, M., & Chao, I. (2020) Exploring assessor cognition as a source of score variability in a performance assessment of practice-based competencies. BMC Medical Education, 20(1), 168. Rushby, J. & Granger, H. (2020) Can multiple mini-interviews assist in selecting candidates for Initial Teacher Education? Teacher Select. https://www.teacherselect.org/can-mmis-assist-in-selecting-candidates-for-ite/ Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9(1).

Author Information

Conor Scully (presenting / submitting)

Dublin City University

Dublin 1

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.