Session Information
09 SES 02 B, Issues in Assessing Reading and Language Competencies in Mother Tongue, Second and Foreign Language
Paper Session
Contribution
Equity in assessment of students’ knowledge and skills is a cornerstone in all forms of education. This principle should pervade both the everyday classroom assessments and large-scale testing systems. From previous research, however, it is well known that subjective scoring of complex abilities such as writing and speaking inevitably involves disagreement between raters (Borgström & Ledin, 2014; Engelhardt, 2002; Gwet, 2014; Meadows & Billington, 2005; Tengberg, Roe & Skar, submitted). When the disagreement is too large, the reliability of assessment is jeopardized. By extension, this also means that the fundamental right to impartiality and equal opportunities in education is jeopardized. This presentation reports from a study aimed to contribute to reliability of assessment of complex language performances by investigating the effects of a rater training program designed to improve inter-rater reliability.
Different from many other countries, national test performances in Sweden are not assessed by external rater panels, but by classroom teachers all over the country (EACEA, 2009). Often, teachers are responsible for assessing the performances of their own students, or at least students at their own school. Rater training would therefore be an extensive and expensive enterprise, and until now, no particular rater training has been provided to teachers in Sweden. At the same time, the national test results are high-stakes for both students and schools and rater reliability is therefore a critical dimension of test validity.
Previous research on rater reliability has identified various forms of rater effects (cf. Eckes, 2005; Haladyna & Rodriguez, 2013) and theorized on rater cognition (cf. Bejar, 2012). However, empirical findings from studies that investigate effects of rater training programs are mixed. While some studies indicate positive effects, for instance by bringing outliers (i.e., extremely severe or lenient raters) more into line, or by promoting awareness of construct dimensions (Harsch & Martin, 2012; MacIntyre, 1993; Knoch, 2011), other studies demonstrate minor or no effects on rater reliability (Lunz, Wright, & Linacre, 1990; Weigle, 1998). In studies outside of the field of education, various types of rater training programs have been identified and compared. Woehr & Huffcutt (1994), and later Roch et al. (2012), argue for an approach called frame-of-reference (FOR) training, emphasizing particularly raters’ awareness of dimensionality and access to performance prototypes. It seems clear, however, that although there is some evidence in favor of training, generalization across contexts and performance types is unwarranted. First, because in-depth descriptions of the training practices is to a large extent missing. Second, because different educational contexts and task types will interact differently with various forms of rater effects (Haladyna & Rodriquez, 2013).
In language assessment, it is often preferred to use high fidelity tasks, i.e., tasks which simulate processes associated with language use in the real world (McNamara, 2000). But as far as subjective scoring is necessary, test developers and administrators must ensure that test results depend on the level of performance and not on who is doing the scoring.
The present study investigated the effects of a rater training program called Rater Identity Development (RID) on the reliability of assessing middle-grade students’ L1 (Swedish) writing performances and L2 (English) oral performances. More specifically, the aim was to determine
(1) to what extent RID contribute to consistency and consensus in the assessment of students’ L1 writing performances and L2 oral performances, and
(2) in regard to which dimensions of L1 writing performance and L2 oral performance the training contributes to inter-rater reliability
The study yields valuable knowledge to both test developers and administrators, as well as to teachers and teacher educators.
Method
Expected Outcomes
References
Bejar, I. I. (2012). Rater cognition: implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. Borgström, E., & Ledin, P. (2014). Bedömarvariation. Balansen mellan teknisk och hermeneutisk rationalitet vid bedömning av skrivprov. Språk och Stil, 24, 133–165. EACEA; Eurydice (2009). National testing of pupils in Europe: Objectives, organisation and use of results. Brussels: Eurydice. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: a many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 261–288). Mahwah, NJ: Lawrence Erlbaum Associates. Gwet, K. L. (2014). Handbook of inter-rater reliability. Gaithersburg, MD: Advanced Analytics, LLC. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge. Harsch, C. & Martin G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17. 228–250. Knoch, U., Read, J. & von Randow J. (2007) Re-training raters online: How does it compare with face-to-face training? Assessing Writing 12, 26–43. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the reliability of teachers’ assessments of ESL writing samples. Unpublished master’s thesis, University of Melbourne, Melbourne. McNamara, T. F. (2000). Language testing. Oxford: Oxford University Press. Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. London: National Assessment Agency. Matre, S., & Solheim, R. (2015). Writing education and assessment in Norway: Towards shared understanding, shared language and shared responsibility. L1 Educational Studies in Language and Literature, 15, 1–33. Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85, 370–395. Swedish National Agency for Education (2014). Co-assessment in school.[Sambedömning i skolan] Stockholm: Swedish National Agency for Education. Tengberg, M., Roe, A., & Skar, G. B. (sumitted). Weigle, S. (1998). Using FACETS to model rater training effects. Language Testing 15(2) 263–287. Woehr, D.J. & Huffcutt, A.J. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology 67, 189–205.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.