Effects of Rater Training in Language Performance Assessments

Author(s):

Michael Tengberg(presenting / submitting)Gustaf B. Skar Pia Sundqvist Eric Borgström Erica Sandlund Lena Lötmarker

Conference:

ECER 2017

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 02 B, Issues in Assessing Reading and Language Competencies in Mother Tongue, Second and Foreign Language

Paper Session

Time:

2017-08-22

15:15-16:45

Room:

W5.18

Chair:

Sarah Howie

Contribution

Equity in assessment of students’ knowledge and skills is a cornerstone in all forms of education. This principle should pervade both the everyday classroom assessments and large-scale testing systems. From previous research, however, it is well known that subjective scoring of complex abilities such as writing and speaking inevitably involves disagreement between raters (Borgström & Ledin, 2014; Engelhardt, 2002; Gwet, 2014; Meadows & Billington, 2005; Tengberg, Roe & Skar, submitted). When the disagreement is too large, the reliability of assessment is jeopardized. By extension, this also means that the fundamental right to impartiality and equal opportunities in education is jeopardized. This presentation reports from a study aimed to contribute to reliability of assessment of complex language performances by investigating the effects of a rater training program designed to improve inter-rater reliability.

Different from many other countries, national test performances in Sweden are not assessed by external rater panels, but by classroom teachers all over the country (EACEA, 2009). Often, teachers are responsible for assessing the performances of their own students, or at least students at their own school. Rater training would therefore be an extensive and expensive enterprise, and until now, no particular rater training has been provided to teachers in Sweden. At the same time, the national test results are high-stakes for both students and schools and rater reliability is therefore a critical dimension of test validity.

Previous research on rater reliability has identified various forms of rater effects (cf. Eckes, 2005; Haladyna & Rodriguez, 2013) and theorized on rater cognition (cf. Bejar, 2012). However, empirical findings from studies that investigate effects of rater training programs are mixed. While some studies indicate positive effects, for instance by bringing outliers (i.e., extremely severe or lenient raters) more into line, or by promoting awareness of construct dimensions (Harsch & Martin, 2012; MacIntyre, 1993; Knoch, 2011), other studies demonstrate minor or no effects on rater reliability (Lunz, Wright, & Linacre, 1990; Weigle, 1998). In studies outside of the field of education, various types of rater training programs have been identified and compared. Woehr & Huffcutt (1994), and later Roch et al. (2012), argue for an approach called frame-of-reference (FOR) training, emphasizing particularly raters’ awareness of dimensionality and access to performance prototypes. It seems clear, however, that although there is some evidence in favor of training, generalization across contexts and performance types is unwarranted. First, because in-depth descriptions of the training practices is to a large extent missing. Second, because different educational contexts and task types will interact differently with various forms of rater effects (Haladyna & Rodriquez, 2013).

In language assessment, it is often preferred to use high fidelity tasks, i.e., tasks which simulate processes associated with language use in the real world (McNamara, 2000). But as far as subjective scoring is necessary, test developers and administrators must ensure that test results depend on the level of performance and not on who is doing the scoring.

The present study investigated the effects of a rater training program called Rater Identity Development (RID) on the reliability of assessing middle-grade students’ L1 (Swedish) writing performances and L2 (English) oral performances. More specifically, the aim was to determine

(1) to what extent RID contribute to consistency and consensus in the assessment of students’ L1 writing performances and L2 oral performances, and

(2) in regard to which dimensions of L1 writing performance and L2 oral performance the training contributes to inter-rater reliability

The study yields valuable knowledge to both test developers and administrators, as well as to teachers and teacher educators.

Method

The study was designed as an intervention (including pretest and posttest), in which RID was provided to experienced middle-grade teachers of Swedish and English. Participants included 12 teachers of Swedish, focusing on the assessment of 6th grade students’ L1 writing performances, and 11 teachers of English, focusing on the assessment of 6th grade students’ L2 oral performances. Based on pretest results, the teachers of Swedish were divided into two conditions: rater training and co-assessment. Teachers of English were all given rater training. The material to be assessed included representative samples of L1 writing performances and L2 oral performances on the national tests in Swedish and in English in grade 6. RID draws on the principles of FOR training and aims to help raters identify relevant dimensions of the construct including the characteristics of different levels of performance. In addition, RID emphasizes rater meta-cognition and seeks to provide tools for raters to discern the particular traits of their own rater identity and to see themselves as participants in a professional interpretive community (cf. Matre & Solheim, 2015). The program included 1) Feedback on pretest ratings using official benchmark ratings as prototypes; 2) Introduction to relevant concepts from assessment theory and discussion about performance dimensions and how to use the scoring guide; and 3) Repeated sessions of individual rating followed by feedback and moderation to benchmark as well as structured group discussions to verbalize both conflicts and learning. The co-assessment condition was developed by using the official recommendations from the Swedish National Agency for Education (2014). It was expected to resemble prevalent co-assessment practices at schools when scoring the national tests. Raters were asked to review their own pretest scoring and then discuss in groups cases of disagreement with the intention of arriving at consensus in as many cases as possible. This exercise was repeated in several sessions. In order to determine effects on inter-rater reliability, rating data from pretest and posttest was used to produce estimates of consensus, consistency and relative severity. In order to analyze interaction between raters, tasks and examinee proficiency, Rasch modelling was also employed. In addition to quantitative measures, a selection of group discussions in both conditions were tape-recorded and will be subjected to interaction analysis. These analyses are expected to reveal in more depth how teachers motivate their assessments and how they respond to training (including the benchmark feedback).

Expected Outcomes

Preliminary findings of the study indicate that the RID training contributes to improved inter-rater reliability on both the L1 writing and the L2 speaking tests. In particular, the measures of consistency were improved, on all dimensions of the constructs. In contrast, the co-assessment condition seemed to contribute little, or not at all, to improving reliability between raters. There are, however, findings which complicate the picture and that will require further analysis. The Rasch modeling, for instance, revealed an increased distribution of severity (G-index) between raters, presumably due to the fact that enhanced consistency and enhanced intra-rater reliability allowed for measuring separation between raters with higher precision. Preliminary checks of the collected interaction data reveal that the motivations for particular scores and the negotiation between teachers when co-assessing are not exclusively and always rooted in subject-related evidence. On the contrary, debates that eventually decides on student results also relies on the conventions and constraints of social interplay, of nurturing relationships, balancing power et cetera. Based on these initial analyses, there are reasons to be optimistic about the potential of RID as a means for improving inter-rater reliability of complex language performances. The presentation will include a discussion of relevant implications of these results.

References

Bejar, I. I. (2012). Rater cognition: implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. Borgström, E., & Ledin, P. (2014). Bedömarvariation. Balansen mellan teknisk och hermeneutisk rationalitet vid bedömning av skrivprov. Språk och Stil, 24, 133–165. EACEA; Eurydice (2009). National testing of pupils in Europe: Objectives, organisation and use of results. Brussels: Eurydice. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: a many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 261–288). Mahwah, NJ: Lawrence Erlbaum Associates. Gwet, K. L. (2014). Handbook of inter-rater reliability. Gaithersburg, MD: Advanced Analytics, LLC. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge. Harsch, C. & Martin G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17. 228–250. Knoch, U., Read, J. & von Randow J. (2007) Re-training raters online: How does it compare with face-to-face training? Assessing Writing 12, 26–43. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the reliability of teachers’ assessments of ESL writing samples. Unpublished master’s thesis, University of Melbourne, Melbourne. McNamara, T. F. (2000). Language testing. Oxford: Oxford University Press. Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. London: National Assessment Agency. Matre, S., & Solheim, R. (2015). Writing education and assessment in Norway: Towards shared understanding, shared language and shared responsibility. L1 Educational Studies in Language and Literature, 15, 1–33. Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85, 370–395. Swedish National Agency for Education (2014). Co-assessment in school.[Sambedömning i skolan] Stockholm: Swedish National Agency for Education. Tengberg, M., Roe, A., & Skar, G. B. (sumitted). Weigle, S. (1998). Using FACETS to model rater training effects. Language Testing 15(2) 263–287. Woehr, D.J. & Huffcutt, A.J. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology 67, 189–205.

Author Information

Michael Tengberg (presenting / submitting)

Karlstad University

Department of Educational Studies

Karlstad

Gustaf B. Skar

Norwegian University of Science and Technology

Pia Sundqvist

Karlstad University, Sweden

Eric Borgström

Örebro University

Erica Sandlund

Karlstad University, Sweden

Lena Lötmarker

Karlstad University, Sweden

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.

Session Information

Contribution

Method

Expected Outcomes

References

Author Information

Search the ECER Programme

Navigation

Info for