Session Information
09 SES 05 B, Assessing Competencies in Mother Tongue, Second and Foreign Language
Paper Session
Contribution
The current study is situated within the domain of high-stakes performance assessment. The aim of the study was to explore rater variability in an oral foreign language (FL) high-stakes performance test. This was done by using a mixed methods approach to highlight both score variation and the rating process; in other words to look behind scores with the aim of interpreting their meaning. The research questions posed in the study address three main issues: a) inter-rater variability, b) performance features salient to raters in their decision-making, and c) the possible relationship between scores and raters’ justifications of these scores.
Performance-based, constructed response tasks are increasingly used in high-stakes testing in Europe and internationally, since some of the most central educational goals, such as speaking and writing, necessitate evaluation of complex performances carried out by human raters. An advantage of performance assessment is the direct link between the type of performance observed and the performance in focus. This directness of the assessment is considered a sign of high fidelity or authenticity (Fitzpatrick & Morrison, 1971; Wiggins, 2011). However, the psychometric challenges related to the reliability of rater-mediated assessments are well known (Congdon & McQueen, 2000; Dunbar, Koretz, & Hoover, 1991). Factors associated with raters rather than test-taker performance, so-called rater effects, may affect the validity and reliability of assessment outcomes and are thus regarded as sources of construct-irrelevant variance (Messick, 1989). Both quantitative and qualitative methods are used to investigate rater effects, which can be manifested in many different forms (Bachman & Palmer, 2010; McNamara, 1996; Weir, 2005), such as rater severity and leniency, differences in perception and weighting of rating criteria, and interaction effects between raters and different facets of the test situation. Further research into rater effects is needed, especially in the case of high-stakes testing, since results may contribute important evidence to the validity argument of performance assessments by identifying systematic and predictable patterns of rater behaviour that can be compensated for in various ways.
The performance test used in this study is a paired speaking test, part of a mandatory Swedish national test of English as a foreign language (EFL). An advantage of paired or group speaking tests is the potential for test-takers to produce a wider range of interactional functions than in the traditional oral proficiency interview (OPI) with one interviewer and one interviewee. This makes the paired or group speaking test a stronger indicator of the construct of oral interaction. On the other hand, variables associated with test-takers, so-called interlocutor effects, such as personality, familiarity, gender and proficiency level (Berry, 1993; Iwashita, 1996), may affect the peer interaction in unpredictable ways, thereby threatening validity and reliability. Also, assessing multiple speakers is complex due to the co-constructed nature of the performance (Galaczi, 2008). The question is whose performance is actually graded (McNamara, 1997).
In the Swedish school system, teachers have great responsibility with regard to assessment and grading. There are no external examinations with central marking and final grades are assigned exclusively by the students’ own teachers. To help teachers make decisions about grades, there is a system of mandatory national tests in different subjects – EFL being one – and at different levels. The national tests are considered high-stakes, even though they have an advisory rather than decisive function (Erickson, 2010). Given their high-stakes nature and the consequential aspects of the assessment outcomes for students’ future, investigations into different aspects of the marking of national tests are necessary, in particular the performance-based parts. Results of such research may have implications for any educational context where large-scale performance assessments are used.
Method
Expected Outcomes
References
Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford: Oxford University Press. Berry, V. (1993). Personality characteristics as a potential source of language test bias. In A. Huhta, K. Sajavaara, & S. Takala (Eds.), Language testing: New openings (pp. 115-124). Jyvaskyla, Finland: Institute for Educational research. Congdon, P. J., & McQueen, J. (2000). The Stability of Rater Severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37(2), 163-178. doi:10.1111/j.1745-3984.2000.tb01081.x Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education, 4(4), 289. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=7366095&site=ehost-live Erickson, G. (2010). A New Look at Teaching and Testing: English as Subject and Vehicle. In T. Kao & Y. Lin (Eds.), Good Practice in Language Testing and Assessment – A Matter of Responsibility and Respect (pp. 237-258). Taipei, Taiwan: Bookman Books Ltd. Fitzpatrick, R., & Morrison, E. J. (1971). Performance and product evaluation. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed., pp. 237-270). Washington D.C.: American Council on Education. (Reprinted from: Finch, F.L. (ed.) (1991). Educational Performance Assessment. Chicago: The Riverside Publishing Company, pp. 89-138.). Galaczi, E. D. (2008). Peer–Peer Interaction in a Speaking Test: The Case of the First Certificate in English Examination. Language Assessment Quarterly, 5(2), 89-119. doi:10.1080/15434300801934702 Iwashita, N. (1996). The validity of the paired interview format in oral performance assessment. Melbourne Papers in Language Testing, 5(2), 51-66. Retrieved from http://www.ltrc.unimelb.edu.au/mplt/volumes/05_02 November1996.pdf McNamara, T. F. (1996). Measuring Second Language Performance. London and New York: Addison Wesley Longman. McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whose performance? Applied Linguistics, 18(4), 446-466. doi:10.1093/applin/18.4.446 Messick, S. A. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). New York: Macmillan. Weir, C. J. (2005). Language Testing and Validation: An Evidence-Based Approach. Basingstoke: Palgrave Macmillan. Wiggins, G. (2011). A True Test: Toward More Authentic and Equitable Assessment. Phi Delta Kappan, 92(7), 81-93. doi:10.1177/003172171109200721
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.