Session Information
09 SES 13 A JS, Advancing Assessment Tools and Strategies in Subject-Specific Contexts
Joint Paper Session NW09 and NW 27
Contribution
Standardized testing is a manifestation of the neoliberal agenda and human capital theory (Rizvi & Lingard, 2010). Testing is perceived as one of the instruments to hold teachers and schools accountable for students’ performance which can lead either to rewards or sanctions. Standardized testing is one of the means of implicit control and governance that allow policymakers and politicians to audit the education system (Graham & Neu, 2004). Critics of standardized testing argue that it widens the gap between different groups of the student population (Au, 2016), encourages teachers to teach to the test and ignore the unassessed curriculum content and other subjects (Lingard, 2011; Koretz, 2017; Bach, 2020), and facilitates the practice of gaming the system to illustrate the growth in student performance (Rezai-Rashti & Segeren, 2020; Heilig & Darling-Hammond, 2008). Despite the severe criticism of standardized testing, it still can be used as an effective tool to inform teaching and learning. Testing can help curriculum designers, test developers, teachers, and educators identify students’ needs and tailor instruction in relation to those needs (Hamilton et al., 2002; Brown, 2013; Singh et al., 2015). It can also allow policymakers to evaluate the success and efficacy of the education system and identify the potential issues that could be addressed (Campbell and Levin, 2009).
The purpose of the study is to construct and validate reading assessments that account for the local contextual factors such as curriculum standards and expectations and that could provide formative information to students and teachers. The current research study includes several stages: pre-pilot study, pilot study, and main studies. In this abstract, the results of the pilot study will be presented.
The study aims to answer the following research questions:
What are the students' perceptions of the proposed testing instrument?
What are the psychometric properties of the pilot test?
The theoretical framework that guides my research is entitled evidence-centered design by Mislevy and Riconscente (2006). ECD employs the concept of layers where each layer possesses its own characteristics and processes. The goal of domain analysis is to collect substantive information about the target domain and to determine the knowledge, skills, and abilities about which assessment claims will be made. Domain modelling organizes the results of domain analysis to articulate an assessment argument that links observations of student actions to inferences about what they know or can do. Design patterns in domain modelling are arguments that enable assessment specialists and domain experts to gather evidence about student knowledge (Mislevy & Haertel, 2006). The third layer – conceptual assessment framework - provides the internals and details of operational assessments. The structure of the conceptual assessment framework (CAF) is expressed as variables, task schemas, and scoring mechanisms. This layer generates a blueprint for the intended assessment and gives a concrete shape to it. Assessment implementation constructs and prepares all operational elements specified in CAF: authoring tasks, finalizing scoring rubrics, establishing parameters in measurement models, and the like. The assessment delivery layer is where students are engaged with the assessment tasks, their performances are evaluated and measured, and feedback and reports are produced. Thus, ECD provides an essential framework for approaching test design, scoring, and analysis. In my study, the ECD framework will act as guidance to ensure that each layer is constructed, and relevant evidence accumulated.
Method
The instrument of the proposed research is designed to assess students’ reading literacy in English. A number of standardized tests that measure reading literacy were reviewed. The main criteria for selecting tests were: 1) Anglophone tests; 2) availability of an online test for the public use; 3) standardized reading literacy tests; 4) tests for secondary and high school students; 5) grade-appropriate language and cognitive difficulty levels of the reading passages and test items; 6) a sufficient number of test items; 7) tests that have been used with big populations. Texts that displayed cultural bias and other features that might negatively impact test validity and reliability were not selected. Since it is important to ensure alignment between assessment instrument and curriculum, subject experts were involved in the present study. I also used one element of Webb’s alignment model (1997) which is depth-of-knowledge (DOK) criteria that include the levels of cognitive complexity: recall of information, basic reasoning, complex reasoning, and extended reasoning. First, experts matched DOK level and curriculum objectives with the test items. Before independently coding test items, the subject experts independently coded between five to ten items and then compared DOK levels assigned to the test items and corresponding learning objectives (Webb, 2002). After this stage, experts identified one or two objectives from the curriculum which correspond to each test item. It is not required that all experts should reach a unanimous decision about the correspondence between items and objectives. Teachers’ feedback would help to eliminate items that (1) do not map with the curriculum, or (2) might be considered ambiguous or confusing for students. Expert judgement may also help to identify potential sources of irrelevantly difficult items or items that might be far too easy for lower ability students (AERA, 2014). Piloting assessment items is one of the ways to ensure test validity. Standard 3.3 of AERA (2014) states that analyses carried out in pilot testing should identify the aspects of the test design, content, and format that might distort the interpretations of the test scores for the intended population. In the current study, pre-test items were piloted with 11th grade students in one of the target schools. After test piloting, retrospective probing were conducted. The main goal of retrospective probing is to examine participants’ understanding of the tasks or questions (Leighton, 2017).
Expected Outcomes
Five Grade 11 students were interviewed regarding their perceptions of the reading test. Three female and two male students participated in the interview. Overall, students made recommendations regarding some of the questions and distractors. For instance, students pointed out the unclarity of the distractors in some questions. Furthermore, some students argued that two correct options are possible in one of the questions. The questions that were identified problematic and confusing for students were reviewed and the corresponding changes were made. The reading literacy test comprised 32 multiple-choice questions (31 items were dichotomous while one item was partial credit: 0, 1, 2). Test was administered among 69 Grade 11 students in a pilot school site. The mean of the students’ responses was 17.23 (SD = 5.71). The minimum score was 5 and the maximum score was 29. Cronbach’s alpha was estimated to be .79 which is an indicator of an acceptable level of test reliability (DeVellis, 2017). However, the estimation of point-biserial correlations illustrate that some items have low levels of item discrimination even though all items exhibited positive values suggesting that all items are tapped to the reading construct. Test items were analyzed employing the Rasch model with the assistance of the TAM package (Robitzsch, et al., 2021) using marginal maximum likelihood (MML) estimation (Bock & Aitkin, 1981). As one item involved partial scoring, the Masters (1982) partial credit Rasch model was used. Item difficulty estimates were constrained to zero, though the mean difficulty estimate was -0.09 (SD = 0.87). Item fit analysis revealed some of the problematic items that should be reviewed prior testing with larger population of students.
References
American Educational Research Association (2014). Standards for educational psychological testing. American Educational Research Association. Au, W. (2016). Meritocracy 2.0: High-stakes, standardized testing as a racial project of neoliberal multiculturalism. Educational Policy, 30(1), 39-62. Bach, A. J. (2020). High-Stakes, standardized testing and emergent bilingual students in Texas. Texas Journal of Literacy Education, 8(1), 18-37. Retrieved September 30, 2021, from https://www.talejournal.com/index.php/TJLE/article/view/42 Brown, G. T. (2013). asTTle–A National Testing System for Formative Assessment: how the national testing policy ended up helping schools and teachers. In M. Lai & S. Kushner, A developmental and negotiated approach to school self-evaluation (pp. 39-56). Emerald Group Publishing Limited. Campbell, C., & Levin, B. (2009). Using data to support educational improvement. Educational Assessment, Evaluation and Accountability, 21(1), 47-65. DeVellis, R. F. (2017). Scale development. Theory and Applications (4th ed.). SAGE. Hamilton, L. S., Stecher, B. M., & Klein, S. P. (2002). Introduction. In L.S. Hamilton, B.M. Stecher & S.P. Klein (Eds.), Making sense of test-based accountability in education (pp.1-12). RAND. Heilig, J. V., & Darling-Hammond, L. (2008). Accountability Texas-style: The progress and learning of urban minority students in a high-stakes testing context. Educational Evaluation and Policy Analysis, 30(2), 75-110. Koretz, D. (2017). The testing Charade: Pretending to make schools better. The University of Chicago Press. Leighton, J.P. (2017). Using think-aloud interviews and cognitive labs in educational research. Oxford University Press. Lingard, B. (2011). Policy as numbers: Ac/counting for educational research. The Australian Educational Researcher, 38(4), 355-382. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174 doi:10.1007/BF02296272 Mislevy, R. J., & Haertel, G. (2006). Implications of evidence-centered design for educational assessment. Educational Measurement: Issues and Practice, 25, 6–20. Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design: Layers, concepts, and terminology. In S. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 61–90). Erlbaum. Rezai-Rashti, G. M., & Segeren, A. (2020). The game of accountability: perspectives of urban school leaders on standardized testing in Ontario and British Columbia, Canada. International Journal of Leadership in Education, 1-18. doi.org/10.1080/13603124.2020.1808711 Robitzsch, A., Kiefer, T., & Wu, M. (2021). TAM: Test Analysis Modules. R package version 3.7-16. https://CRAN.R-project.org/package=TAM Singh, P., Märtsin, M., & Glasswell, K. (2015). Dilemmatic spaces: High-stakes testing and the possibilities of collaborative knowledge work to generate learning innovations. Teachers and Teaching, 21(4), 379-399.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.