Introduction
Almost half a century of DIF studies has not produced a set of recommendations as to how to write items with little or no bias. We do not seem to have much control over sources of item bias. New approaches to design and analysis are needed to advance the bias field. Propensity score matching has the potential to be such a procedure to shed light on item bias in cross-cultural research. Propensity score matching can be used to produce comparable sample groups by equating groups on relevant background variables. Bias detection procedures and propensity matching procedures share an important characteristic in that they look for matches in different ethnic groups/countries on the basis of some background or psychological characteristic, such as socioeconomic status or total test score. The main difference, however, is that unlike bias detection procedures, propensity matching allows for multiple background variables to be factored in at the same time and that the matching variables do not need to be derived from the target instrument that is scrutinized for bias, such as an educational achievement test, which is typically the case in DIF studies. As a consequence, propensity scoring may provide us with a better tool to control sources of item bias. We examined the impact of propensity matching by comparing DIF and the size of cross-cultural differences before and after matching on student background variables, using PISA 2012 mathematics data.
When researchers employed randomized experimental designs, the comparison groups are formed to be only randomly different on all background covariates. However, in studies comparing intact groups or nations, randomization is impossible. Matching methods using propensity scores could then be used to compose comparable samples by equating the distribution of covariates in the comparison groups (Stuart, 2010). If the pre-existing achievement differences between countries would disappear after matching, it can be concluded that the country differences in achievement can be attributed to the background differences. When used this way, propensity matching can be seen as an advanced kind of covariance analysis (Van de Vijver & Poortinga, 1997).
DIF procedures are based on matching on test score. We argue that matching on additional, potentially bias-relevant background variables would be helpful to identify sources of DIF. What we do here can be seen as a combination of a procedure called thin matching (the use of total score as the matching variable) and thick matching (forming the matching variable by pooling total score levels) (Donoghue & Allen, 1993). In this study, using exact, nearest neighbor, and optimal matching methods, PISA 2012 mathematics items were analyzed in terms of DIF for Indonesian, Turkish, Australian, and Dutch students. In the study, Indonesian students were included to represent a low achieving country, Turkish students were included to represent a below average country, Australian students were included to represent an above average country and Dutch students were included to represent a high achieving country according to results of PISA 2012. By using various types of matching methods on data of these differentially achieving countries, we aim to evaluate effects of various matching methods that use propensity score methodology to study DIF results and to understand the nature of bias in the comparison of educational achievement of these four countries. So, we examined to what extent propensity score matching methods are effective in understanding nature of bias by reducing or eliminating the bias sources in the comparison of PISA mathematics achievement and to what extent propensity score matching is able to explain cross-national differences in mathematics performance.