Interrater Reliability of Constructed Response Items in Standardized Tests of Reading

Author(s):

Michael Tengberg(presenting / submitting)Astrid Roe(presenting)Gustaf B. Skar

Conference:

ECER 2016

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 07 C, Scrutinizing Tests and Assessments in Reading

Paper Session

Time:

2016-08-24

17:15-18:45

Room:

NM-F107

Chair:

Rolf Vegar Olsen

Contribution

Interrater reliability is a crucial component in any test program where test-takers’ responses are judged by human raters using scales or scoring rubrics. For oral presentations or extended written responses to reading test items there are unsually no single predefined correct answers. Rather, scoring rubrics must be interpreted by raters and used to determine whether a particular item response displays the expected competence or knowledge. Standardized tests of reading comprehension, such as national tests or the PISA and PIRLS tests generally include a share of constructed response (CR) items, for which this type of rater interpretation of student performances is required. In order to validate the test construction, thus, rating of CR items must be reliable, meaning that raters need to be both consistent and free from different forms of rater effects (Haladyna & Rodriguez, 2013). For reasons of ecological validity, the CR format is often favored by both test constructors and teachers, but the use of it in standardized tests is still restricted because of rater variation, and the multiple-choice (MC) format is used instead (Campbell, 2005; Solheim & Skaftun, 2009)

The study of interrater reliability of reading test items is a limited area of research, although some researchers have demonstrated that it is possible to attain high levels of consistency between raters (DeSanti & Sullivan, 1984; Taboada, Tonks, Wigfield & Guthrie, 2013). Obviously, the extent of reliability, as well as the exact definition of what might qualify as a “high level” of reliability, will depend on both item construction and the level of rater training. Therefore, any test program that requires subjective scoring needs to evaluate and validate their own proportion of rater reliability (Bejar, 2012).

National tests of reading in Norway and Sweden are population based and both of them involve classroom teachers from the entire country to do the scoring. Often teachers score the performances of their own students, or at least students at their own school. Rater training is therefore an extensive and expensive enterprise, why it is necessary that open-ended items (CR items) are composed in a way that supports reliable assessment. The Norwegian national reading test in 8^th grade contains about 25% CR items and 75% MC items. As with the MC items, CR items are scored on a dichotomous scale (students receive full credit or no credit). The Swedish reading test in 9^th grade, quite oppositely, contains about 75% CR items and about 25% MC items. Here, CR items are normally scored using longer (up to 5 levels) scales. It is thus reasonable to assume that the Swedish reading test may be more sensitive to rater variation than the Norwegian test is.

In this paper presentation, we bring together the results from two different pilot studies in which interrater reliability of the national reading tests in Norway and Sweden respectively was examined. The purpose is to find out to what range the two test designs support reliable assessment of open-ended responses. In addition, we want to investigate possible causes of rater variability connected to item design. This knowledge is valuable for several reasons. First of all, as pointed out above, interrater reliability is a critical aspect of validating the results of any test design involving subjective scoring. Second, it is useful in order to make informed decisions about the balance between different response format in future test designs. Third, knowledge about teachers’ rating of reading test responses is a key to understand the daily classroom assessment of reading comprehension in school, and thereby also a key to understand formative use of assessments as well as semester grading of student achievements.

Method

Ratings of student responses to CR items were collected from the Swedish national reading test in 9th grade (administered during 2015) and from the Norwegian national reading test in 8th grade (administered during 2015). In the Swedish pilot study, six teachers rated the responses of three students on 14 different open-ended items, 252 ratings in all. Considering the small sample of students and teachers, it is essential to point out that what is being measured is not the extent to which teachers of Swedish are capable of reliable assessment of students reading ability, but rather whether the reading test itself is designed in a way that supports reliable assessment. The teachers were all qualified and experienced teachers of Swedish. In the Norwegian study, 20 teachers rated the responses of 23 students on 11 open-ended items, 5060 ratings in total. The teachers were all qualified and participants in an in-depth course on the use of national test results. To investigate the range to which the two test designs support reliable assessment, we have studied interrater reliability on the three categories defined by Stemler (2004): consensus estimates, consistency estimates, and measurement estimates. In the analysis, these categories are represented by statistical measures such as Cohen’s kappa (weighted), Intraclass correlation coefficient (ICC), and rater mean and standard deviation. We have also conducted a Many-Facets Rasch Measurement (MFRM) analysis (Eckes, 2015; c.f. Rasch, 1980), using the rating scale model in the software FACETS (Linacre, 2014). Non-technically, the MFRM model assumes that a person’s score primarily is a function of person ability, item difficulty and rater severity. By modelling ability, difficulty and severity on the logit scale, it is possible to compare observed raw scores with scaled scores, where rater severity is controlled for. The MFRM analysis also can be used for investigation of a number of known rater effects, such as halo and central tendency, as well as rater severity (Myford & Wolfe, 2003, 2004). Based on reliablity measures, we have also conducted qualitative item analysis in order to identify characteristics of both items and student responses that may contribute to rater variation.

Expected Outcomes

It would be expected that the less complex CR items in the Norwegian test would result in higher rater reliability. However, preliminary findings suggest that this is not the case. Results from Cohen’s kappa-analyses of both the Swedish and the Norwegian material show that the average level of agreement is only slightly above the level that McNamara (2000:58) defines as the “rock-bottom minimum of acceptable agreement among raters”, i.e., .70. Cohen’s kappa for both Swedish and Norwegian ratings was .73. In the presentation, these reults are complemented with results from ICC and MFRM analyses as well as results from qualitative item analysis. Preliminary findings from the latter demonstrate that certain items result in considerable disagreement between raters. Typical characteristics of these items are that they require extensive interpretation or reflection and that students are to provide reasons for their responses. Items that require retrieval of information generally result in larger levels of agreement. As for the responses themselves, the anlysis indicates that vague responses interact significantly with rater effects such as leniniency or severity. Furthermore, responses which seem reasonable or even qualified to raters, although they are not listed in the scoring guide, may also cause confusion and rater variation. In the presentation, findings from the study are discussed both in terms of their implications for equal assessment of students’ reading ability, and in terms of necessary future development of the national reading tests in Norway and Sweden.

References

Bejar, I (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. Campbell, J. R. (2005). Single instrument, multiple measures: Considering the use of multiple item formats to assess reading comprehension. In S. G. Paris, & S. A. Stahl (Eds.), Children’s reading comprehension and assessment (pp. 347–368). Mahwah, New Jersey: Lawrence Erlbaum Ass. DeSanti & Sullivan (1984). Inter-rater reliability of the cloze reading inventory as a qualitative measure of reading comprehension. Reading Psychology: An International Journal, 5, 203–208. Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge. Illinois State Board of Education (2013). Illinois standards achievement test 2013. Technical Manual. Springfield, IL: Illinois State Board of Education, Division of Assessment. Linacre, J. M. (2014). Facets® (version 3.71.4) [Computer Software]. Beaverton, Oregon: Winsteps.com. McNamara, T. F. (2000). Language testing. Oxford: Oxford University Press. Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Solheim, O. J., & Skaftun, A. (2009). The problem of semantic openness and constructed response. Assessment in Education: Principles, Policy & Practice, 16(2), 149–164. Stemler, S. E. (2004). A comparison of consensus, consistency, and measurment approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9(4), 1–19. Taboada, A., Tonks, S. M., Wigfield, A., & Guthrie, J. (2013). Effects of motivational and cognitive variables on reading comprehension. In D. E. Alvermann, N. J. Unrau, & R. B. Ruddell (Eds.), Theoretical models and processes of reading (6th ed.) (p. 589–610). Newark, DE: International Reading Association.

Author Information

Michael Tengberg (presenting / submitting)

Karlstad university

Department of Educational Studies

Karlstad

Astrid Roe (presenting)

University of Oslo, Norway

Gustaf B. Skar

Norwegian University of Science and Technology, Norway

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.