Session Information
09 SES 07 C, Scrutinizing Tests and Assessments in Reading
Paper Session
Contribution
Interrater reliability is a crucial component in any test program where test-takers’ responses are judged by human raters using scales or scoring rubrics. For oral presentations or extended written responses to reading test items there are unsually no single predefined correct answers. Rather, scoring rubrics must be interpreted by raters and used to determine whether a particular item response displays the expected competence or knowledge. Standardized tests of reading comprehension, such as national tests or the PISA and PIRLS tests generally include a share of constructed response (CR) items, for which this type of rater interpretation of student performances is required. In order to validate the test construction, thus, rating of CR items must be reliable, meaning that raters need to be both consistent and free from different forms of rater effects (Haladyna & Rodriguez, 2013). For reasons of ecological validity, the CR format is often favored by both test constructors and teachers, but the use of it in standardized tests is still restricted because of rater variation, and the multiple-choice (MC) format is used instead (Campbell, 2005; Solheim & Skaftun, 2009)
The study of interrater reliability of reading test items is a limited area of research, although some researchers have demonstrated that it is possible to attain high levels of consistency between raters (DeSanti & Sullivan, 1984; Taboada, Tonks, Wigfield & Guthrie, 2013). Obviously, the extent of reliability, as well as the exact definition of what might qualify as a “high level” of reliability, will depend on both item construction and the level of rater training. Therefore, any test program that requires subjective scoring needs to evaluate and validate their own proportion of rater reliability (Bejar, 2012).
National tests of reading in Norway and Sweden are population based and both of them involve classroom teachers from the entire country to do the scoring. Often teachers score the performances of their own students, or at least students at their own school. Rater training is therefore an extensive and expensive enterprise, why it is necessary that open-ended items (CR items) are composed in a way that supports reliable assessment. The Norwegian national reading test in 8th grade contains about 25% CR items and 75% MC items. As with the MC items, CR items are scored on a dichotomous scale (students receive full credit or no credit). The Swedish reading test in 9th grade, quite oppositely, contains about 75% CR items and about 25% MC items. Here, CR items are normally scored using longer (up to 5 levels) scales. It is thus reasonable to assume that the Swedish reading test may be more sensitive to rater variation than the Norwegian test is.
In this paper presentation, we bring together the results from two different pilot studies in which interrater reliability of the national reading tests in Norway and Sweden respectively was examined. The purpose is to find out to what range the two test designs support reliable assessment of open-ended responses. In addition, we want to investigate possible causes of rater variability connected to item design. This knowledge is valuable for several reasons. First of all, as pointed out above, interrater reliability is a critical aspect of validating the results of any test design involving subjective scoring. Second, it is useful in order to make informed decisions about the balance between different response format in future test designs. Third, knowledge about teachers’ rating of reading test responses is a key to understand the daily classroom assessment of reading comprehension in school, and thereby also a key to understand formative use of assessments as well as semester grading of student achievements.
Method
Expected Outcomes
References
Bejar, I (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. Campbell, J. R. (2005). Single instrument, multiple measures: Considering the use of multiple item formats to assess reading comprehension. In S. G. Paris, & S. A. Stahl (Eds.), Children’s reading comprehension and assessment (pp. 347–368). Mahwah, New Jersey: Lawrence Erlbaum Ass. DeSanti & Sullivan (1984). Inter-rater reliability of the cloze reading inventory as a qualitative measure of reading comprehension. Reading Psychology: An International Journal, 5, 203–208. Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge. Illinois State Board of Education (2013). Illinois standards achievement test 2013. Technical Manual. Springfield, IL: Illinois State Board of Education, Division of Assessment. Linacre, J. M. (2014). Facets® (version 3.71.4) [Computer Software]. Beaverton, Oregon: Winsteps.com. McNamara, T. F. (2000). Language testing. Oxford: Oxford University Press. Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Solheim, O. J., & Skaftun, A. (2009). The problem of semantic openness and constructed response. Assessment in Education: Principles, Policy & Practice, 16(2), 149–164. Stemler, S. E. (2004). A comparison of consensus, consistency, and measurment approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9(4), 1–19. Taboada, A., Tonks, S. M., Wigfield, A., & Guthrie, J. (2013). Effects of motivational and cognitive variables on reading comprehension. In D. E. Alvermann, N. J. Unrau, & R. B. Ruddell (Eds.), Theoretical models and processes of reading (6th ed.) (p. 589–610). Newark, DE: International Reading Association.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.