Session Information
24 SES 13 B JS, Assessing Mathematics Achievement
Joint Paper Session NW 09 and NW 24
Contribution
When an item response modelling is employed to analyse data, it is important to verify that observed data are consistent to the theoretical assumptions underlying the model. Although a lot of different methods and techniques have been proposed to assess item fit, this topic still poses theoretical and methodological challenges probably major than the method itself since it seems to require also a close examination of causes generating item misfit.
The general issue that we tackle here is how to use quantitative educational data to understand causes of item misfit. In fact, our research hypothesis is that a misfit need not necessarily be interpreted as a limitation (of the test or even of the choice of the model) but as a potential source of information both about the actual construct measured by the test, and the misfitting item itself.
In light of this, we propose a possible approach to the interpretation of the Rasch output in educational research, with a particular attention to Math Education. To this end, we employ a mixed-method approach based on the combination of quantitative analysis of data collected by the Italian National Institute for the Evaluation of Educational System (INVALSI), analysed by the Rasch model, and the interpretation of this output from a didactical point of view. Each year, INVALSI administers a Math achievement test aimed at assessing mathematical competence, i.e. «the ability to develop and apply mathematical thinking in order to solve a range of problems in everyday situations. Building on a sound mastery of numeracy, the emphasis is on process and activity, as well as knowledge. Mathematical competence involves, to different degrees, the ability and willingness to use mathematical modes of thought (logical and spatial thinking) and presentation (formulas, models, constructs, graphs, charts)» (European Recommendation 2006/962/EC, p. 6).
In particular, to assess item fit we compare different methods, employed by ConQuest 4.0 and RUMM2030 as well.
Our analysis confirms that items’ deviation from Rasch model expectations is more than a measurement problem: it is instead a valuable source of information useful to understand the causes of misfit and, therefore, useful to prevent it.
As additional output of our analysis, we propose some possible uses of empirical evidences here achieved to interpret specific features of different national educational systems. We underline that even if our case study is based on Italian data and we make explicit reference to features of Italian didactics praxes, nevertheless, our results show how our approach may provide new insights for a general interpretation of standard Rasch results, which are commonly used in educational research all over the world.
Method
Our methodological frame refers to mixed-method paradigm: we perform a qualitative interpretation of large quantitative data, analysed by means of the Rasch model and collected by the Italian National Institute for the Evaluation of Educational System (INVALSI) to assess students’ ability in Mathematics at grade 10 (upper secondary school). In particular, we deepen the study of item misfit assuming that item deviations from model’s expectations could tell us something about students’ answering behaviour and hence something about teaching praxis. In our investigation, fit control is based on the graphical inspection of Item Characteristic Curves (ICCs) which allows the identification of deviations between observed and expected values for specific ability levels. This can be particularly useful in order to formulate specific hypotheses aimed at understanding and identifying some possible causes of violations. Each ICC is a logistic regression line, with item performance regressed on examinee ability, which links a student’s probability of success on an item to the trait measured by the set of test items. The probability of a correct answer is estimated by comparing the student’s intrinsic ability and item. Since the Rasch model hypothesizes that just student ability and item difficulty determine the interaction between person and item, it generates a very sturdy estimation environment «(...) against which to test data for the presence of anomalous behavior that may influence the estimation of item and person parameters. This identification (…) addresses any potential measurement disturbance» (Smith, 1993, 262). By means of the Rasch model, we analyse 354 Math items administered by INVALSI to all Italian students (around 30000 per year) attending grade 10 of upper secondary school (15- or 16- years old), from 2010 up to 2017. We focus our attention on over-discriminating items. We refer to them as o-DM items and they are characterized by an over-estimation of the probability of a correct answer for low-ability level students and vice versa for the others. Differently from other kind of violations, over-discrimination generally does not cause strong unfair estimation but the study of this item behaviour is able to show new possible and interpretative uses of item misfit at system level.
Expected Outcomes
Items we analyzed are correct in formulation, coherent with what is perceived as belonging to mathematical ability, fitting with the framework of the Italian national curricula and the theoretical framework of the INVALSI large-scale assessment. Nevertheless, they may still present features which are not completely fitting with the prediction of the model (even though their behavior does not affect the reliability of the measurement). We focused our analysis on a group of over-discriminating items. Via a graphical inspection of the characteristic curves and the distractor plots, on one hand, and a qualitative analysis of the tasks, on the other hand, we highlighted a set of similar features of these items, both in their input (formulation, relationship with didactic praxis, etc.) and in their outputs (missing answers, guessing, etc.). This mixed approach allowed us to formulate a feasible interpretation of the misfit based on a conjectured behaviour of the students related to the specific features of the items. Our methodology, based on the integration of quantitative evidence given by Rasch analysis and the interpretation from a didactic point of view, suggests a new and useful utilization of Rasch’s model also in the case of misfitting items. The deviation of empirical data from the expected theoretical line is a signal of a disturbance factor. Nevertheless, in our case this deviation and these factors can be explained and framed in a coherent setting. Hence, deviations from Rasch’s expectations that do not cause concern from a psychometric point of view can be conceived as a further “result” of the Rasch model because they may produce relevant information about students’ behavior, allowing us to formulate some specific conjecture about causes generating items’ misfit. From a symmetric perspective, these items allow us to better outline the actual construct measured by the test.
References
Birnbaum, A. (1968). Some latent trait models. In F. Lord, & M. Novick, Statistical Theories of mental test scores. Reading, Massachusetts: Addison Wesley. D'Amore, B. (2014). Il problema di matematica nella pratica didattica. Modena: Digital Docet. Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Mahwah, New Jersey: Lawrence Erlbaum associates Publishers. Gustafson, J. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology(33). Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer-Nijho. INVALSI. (2017). Rilevazioni Nazionali degli Apprendimenti 2016-2017. Roma: INVALSI. Johansson, R. (2003). Case study methodology. International Conference on Methodologies in Housing Research. Stockholm. Levine, M., & Rubin, D. (1979). Measuring appropriateness of multiple-choice test scores. Journal of Educational Studies, 4, 269-290. Li, M., & Olejnik, S. (1997). The power of Rasch person-fit statistics in detecting unusual response patterns. Applied Psychological Measurement, 21(3), 215-231. Linacre, J. (2002). What do infit and outfit, mean-square and standardized mean. Rasch Measurement Transactions, 16(2). Morgan, D. L. (1998). Practical strategies for combining qualitative and quantitative methods: Applications to health research. Qualitative health research, 8(3), 362-376. OECD. (2013). Assessment and Analytical Framework: Mathematics, Reading, Science,. OECD Publishing. Retrieved from http://dx.doi.org/10.1787/9789264190511-en Rasch, G. (1960/1980). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Education Research. Reise, S. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14(2), 127-137. Reise, S., & Waller, N. (1993). Traitedness and assessment of response pattern scalability. Journal of Personality and Social Psychology, 65, 143-151. The European Parliament and the Council of European. (2006, 12 18). Recommendation of the European Parliament and of the Council of 18 December 2006 on key competences for lifelong learning (2006/962/EC). Wright, B. (1977). Solving measurement problems with the Rasch Model. Journal of Educational Measurement, 14, 97-116. Wright, B. D., Linacre, J. M., Gustafson, J. E., & Martin-Lof, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3). Zimowski, M., Muraki, E., Mislevy, R., & Bock, R. (1996). BILOG-MG: Multiple-group IRT analysis and test maintenance for binary items. Chicago: Scientific Software.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.