Looking behind scores – The case of high-stakes oral language performance assessment

Author(s):

Linda Borger(presenting / submitting)

Conference:

ECER 2016

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 05 B, Assessing Competencies in Mother Tongue, Second and Foreign Language

Paper Session

Time:

2016-08-24

13:30-15:00

Room:

NM-F103a

Chair:

Martin Goy

Contribution

The current study is situated within the domain of high-stakes performance assessment. The aim of the study was to explore rater variability in an oral foreign language (FL) high-stakes performance test. This was done by using a mixed methods approach to highlight both score variation and the rating process; in other words to look behind scores with the aim of interpreting their meaning. The research questions posed in the study address three main issues: a) inter-rater variability, b) performance features salient to raters in their decision-making, and c) the possible relationship between scores and raters’ justifications of these scores.

Performance-based, constructed response tasks are increasingly used in high-stakes testing in Europe and internationally, since some of the most central educational goals, such as speaking and writing, necessitate evaluation of complex performances carried out by human raters. An advantage of performance assessment is the direct link between the type of performance observed and the performance in focus. This directness of the assessment is considered a sign of high fidelity or authenticity (Fitzpatrick & Morrison, 1971; Wiggins, 2011). However, the psychometric challenges related to the reliability of rater-mediated assessments are well known (Congdon & McQueen, 2000; Dunbar, Koretz, & Hoover, 1991). Factors associated with raters rather than test-taker performance, so-called rater effects, may affect the validity and reliability of assessment outcomes and are thus regarded as sources of construct-irrelevant variance (Messick, 1989). Both quantitative and qualitative methods are used to investigate rater effects, which can be manifested in many different forms (Bachman & Palmer, 2010; McNamara, 1996; Weir, 2005), such as rater severity and leniency, differences in perception and weighting of rating criteria, and interaction effects between raters and different facets of the test situation. Further research into rater effects is needed, especially in the case of high-stakes testing, since results may contribute important evidence to the validity argument of performance assessments by identifying systematic and predictable patterns of rater behaviour that can be compensated for in various ways.

The performance test used in this study is a paired speaking test, part of a mandatory Swedish national test of English as a foreign language (EFL). An advantage of paired or group speaking tests is the potential for test-takers to produce a wider range of interactional functions than in the traditional oral proficiency interview (OPI) with one interviewer and one interviewee. This makes the paired or group speaking test a stronger indicator of the construct of oral interaction. On the other hand, variables associated with test-takers, so-called interlocutor effects, such as personality, familiarity, gender and proficiency level (Berry, 1993; Iwashita, 1996), may affect the peer interaction in unpredictable ways, thereby threatening validity and reliability. Also, assessing multiple speakers is complex due to the co-constructed nature of the performance (Galaczi, 2008). The question is whose performance is actually graded (McNamara, 1997).

In the Swedish school system, teachers have great responsibility with regard to assessment and grading. There are no external examinations with central marking and final grades are assigned exclusively by the students’ own teachers. To help teachers make decisions about grades, there is a system of mandatory national tests in different subjects – EFL being one – and at different levels. The national tests are considered high-stakes, even though they have an advisory rather than decisive function (Erickson, 2010). Given their high-stakes nature and the consequential aspects of the assessment outcomes for students’ future, investigations into different aspects of the marking of national tests are necessary, in particular the performance-based parts. Results of such research may have implications for any educational context where large-scale performance assessments are used.

Method

The study used a mixed-methods research design, allowing for the collection of both quantitative and qualitative data. This was done in order to achieve a broader understanding of rater variability. If only score variation is observed, nothing can be said about what the scores represent. Quantitative data in the form of scores therefore need to be supplemented with qualitative data to enable studies of the cognitive processes underlying raters’ score decision-making. The material used in the study consisted of six authentic, audio-recorded paired conversations, amounting to twelve individual performances, from a Swedish national test of English at upper secondary level (with a minimal pass corresponding to B2.1 as defined in the Common European Framework of Reference for Languages). The test is a direct paired or group (the option to use three candidates is given) oral proficiency test involving peer-to-peer interaction, as well as, to a smaller extent, oral production. Two different rater groups participated in the study and assigned holistic scores to the performances: (1) Swedish teachers of English (n = 17), using national performance standards, and (2) external raters from Finland and Spain (n = 14), using scales from the Common European Framework of Reference for Languages (CEFR), the latter to enable a tentative comparison between the Swedish EFL performance standards and the CEFR. The data, individual ratings and written comments, were collected during one-day rating seminars with the different groups. In addition to assigning scores, the 31 raters also provided written verbal reports on features of the performances that contributed to their judgement. These written comments were segmented and coded with the help of NVivo 10 software. The coding scheme was developed on the basis of the criteria that the raters used, which are related to the illustrative scales for oral interaction and production in the CEFR, and other aspects found in the written rater comments. To check the reliability of the coding, 10% of the written comments were co-coded by an assistant researcher and inter-rater agreement was found to be satisfactory. Cases of disagreement were carefully considered and discussed with the co-coder and changes to the coding scheme were made if deemed necessary.

Expected Outcomes

Concerning the first research question about inter-rater variability, descriptive statistics showed reasonable degrees of variability for both the Swedish and CEFR raters. The existence of clear rater profiles with differences in leniency and severity was also evident. Pairwise correlations among the raters were computed using Spearman’s as well as and Kendall’s tau rank order correlation, indicating, in general, acceptable inter-rater consistency, albeit with obvious room for improvement. In addition, Cronbach’s alpha coefficients were high, implying strong internal consistency. Further, although the two rater groups used different rating scales, the ranking of performances was found to be very similar. Regarding the second research question, investigating raters’ attention to performance features, content analysis of the written comments indicated that the raters took a wide array of features into account in their rating decision, but they mainly focused on the given criteria. The most salient performance features related to test takers’ linguistic and pragmatic competences, as well as their interaction strategies, in that order of importance. Analyses also showed that the raters made references to how the two test-takers performed in relation to one another, highlighting the difficulty of separating performances in paired speaking. The third research question examined whether the same performance elicited comments of the same kind from different raters. A comparison between written comments and scores showed that raters noticed fairly similar features across proficiency levels, but in some cases evaluated them differently, leading to different grades. The results of the study will be discussed in relation to the testing model, focusing in particular on the effect of rater variables on performance assessment. Finally, the results will be considered in light of the current policy of national assessment in Sweden, where a high level of trust is placed in the professionalism of teachers with regard to assessment and grading.

References

Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford: Oxford University Press. Berry, V. (1993). Personality characteristics as a potential source of language test bias. In A. Huhta, K. Sajavaara, & S. Takala (Eds.), Language testing: New openings (pp. 115-124). Jyvaskyla, Finland: Institute for Educational research. Congdon, P. J., & McQueen, J. (2000). The Stability of Rater Severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37(2), 163-178. doi:10.1111/j.1745-3984.2000.tb01081.x Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education, 4(4), 289. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=7366095&site=ehost-live Erickson, G. (2010). A New Look at Teaching and Testing: English as Subject and Vehicle. In T. Kao & Y. Lin (Eds.), Good Practice in Language Testing and Assessment – A Matter of Responsibility and Respect (pp. 237-258). Taipei, Taiwan: Bookman Books Ltd. Fitzpatrick, R., & Morrison, E. J. (1971). Performance and product evaluation. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed., pp. 237-270). Washington D.C.: American Council on Education. (Reprinted from: Finch, F.L. (ed.) (1991). Educational Performance Assessment. Chicago: The Riverside Publishing Company, pp. 89-138.). Galaczi, E. D. (2008). Peer–Peer Interaction in a Speaking Test: The Case of the First Certificate in English Examination. Language Assessment Quarterly, 5(2), 89-119. doi:10.1080/15434300801934702 Iwashita, N. (1996). The validity of the paired interview format in oral performance assessment. Melbourne Papers in Language Testing, 5(2), 51-66. Retrieved from http://www.ltrc.unimelb.edu.au/mplt/volumes/05_02 November1996.pdf McNamara, T. F. (1996). Measuring Second Language Performance. London and New York: Addison Wesley Longman. McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whose performance? Applied Linguistics, 18(4), 446-466. doi:10.1093/applin/18.4.446 Messick, S. A. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). New York: Macmillan. Weir, C. J. (2005). Language Testing and Validation: An Evidence-Based Approach. Basingstoke: Palgrave Macmillan. Wiggins, G. (2011). A True Test: Toward More Authentic and Equitable Assessment. Phi Delta Kappan, 92(7), 81-93. doi:10.1177/003172171109200721

Author Information

Linda Borger (presenting / submitting)

University of Gothenburg, Sweden

Department of Education and Special Education

LINKÖPING

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.