Understand Sources of DIF via Propensity Score Matching

Author(s):

Serkan Arikan(presenting / submitting)Fons van de Vijver Kutlay Yagmur

Conference:

ECER 2017

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 05 A, Issues in Measurement and Sampling in Large Scale Assessments

Paper Session

Time:

2017-08-23

13:30-15:00

Room:

W3.11

Chair:

Agnes Stancel-Piatak

Contribution

Introduction

Almost half a century of DIF studies has not produced a set of recommendations as to how to write items with little or no bias. We do not seem to have much control over sources of item bias. New approaches to design and analysis are needed to advance the bias field. Propensity score matching has the potential to be such a procedure to shed light on item bias in cross-cultural research. Propensity score matching can be used to produce comparable sample groups by equating groups on relevant background variables. Bias detection procedures and propensity matching procedures share an important characteristic in that they look for matches in different ethnic groups/countries on the basis of some background or psychological characteristic, such as socioeconomic status or total test score. The main difference, however, is that unlike bias detection procedures, propensity matching allows for multiple background variables to be factored in at the same time and that the matching variables do not need to be derived from the target instrument that is scrutinized for bias, such as an educational achievement test, which is typically the case in DIF studies. As a consequence, propensity scoring may provide us with a better tool to control sources of item bias. We examined the impact of propensity matching by comparing DIF and the size of cross-cultural differences before and after matching on student background variables, using PISA 2012 mathematics data.

When researchers employed randomized experimental designs, the comparison groups are formed to be only randomly different on all background covariates. However, in studies comparing intact groups or nations, randomization is impossible. Matching methods using propensity scores could then be used to compose comparable samples by equating the distribution of covariates in the comparison groups (Stuart, 2010). If the pre-existing achievement differences between countries would disappear after matching, it can be concluded that the country differences in achievement can be attributed to the background differences. When used this way, propensity matching can be seen as an advanced kind of covariance analysis (Van de Vijver & Poortinga, 1997).

DIF procedures are based on matching on test score. We argue that matching on additional, potentially bias-relevant background variables would be helpful to identify sources of DIF. What we do here can be seen as a combination of a procedure called thin matching (the use of total score as the matching variable) and thick matching (forming the matching variable by pooling total score levels) (Donoghue & Allen, 1993). In this study, using exact, nearest neighbor, and optimal matching methods, PISA 2012 mathematics items were analyzed in terms of DIF for Indonesian, Turkish, Australian, and Dutch students. In the study, Indonesian students were included to represent a low achieving country, Turkish students were included to represent a below average country, Australian students were included to represent an above average country and Dutch students were included to represent a high achieving country according to results of PISA 2012. By using various types of matching methods on data of these differentially achieving countries, we aim to evaluate effects of various matching methods that use propensity score methodology to study DIF results and to understand the nature of bias in the comparison of educational achievement of these four countries. So, we examined to what extent propensity score matching methods are effective in understanding nature of bias by reducing or eliminating the bias sources in the comparison of PISA mathematics achievement and to what extent propensity score matching is able to explain cross-national differences in mathematics performance.

Method

Methods Participants The data of this study were obtained from the PISA 2012 data set. This study used all Indonesian, Turkish, Australian, and Dutch students who answered released mathematics items. In this study, the data were investigated from 1078 Indonesian, 951 Turkish, 2824 Australian and 839 Dutch students. Measures PISA 2012 gathered data on students’ mathematics performance and students’ characteristics via cognitive items and student questionnaire, respectively. The present study used released sample items of PISA to evaluate DIF. In the PISA 2012 mathematics test, there were 13 released items that were answered by the samples described above. Student background variables that are considered to be sources of DIF and controlled by propensity scores matching were gender, index of economic, social, and cultural status (ESCS), and opportunity to learn. ESCS, reported by PISA, is a combination of the highest occupational status of parents, the highest educational level of parents, family wealth, cultural possessions, and home educational resources (OECD, 2014). Opportunity to learn is defined as student’s exposure to subject domain content in school previously and is an important predictor of achievement (Schmidt & Maier, 2009). Data Analysis DIF analyses were conducted using structural equation modeling and logistic regression DIF detection methods without matching students on contextual variables. In the SEM procedure, a Confirmatory Factor Analysis was conducted, assessing configural, metric, and scalar invariance. In the logistic regression (LR) procedure, total test score, country, and their interaction were used as predictors. Significance of country and their interaction were taken as evidence for uniform bias (akin to scalar invariance) and non-uniform bias (akin to metric equivalence), respectively. Then, for each comparison group, exact, nearest neighbor and optimal matching methods were performed using gender, the index of economic, social and cultural status, and opportunity to learn as contextual variables. Propensity matching does not yet have a single best procedure and there is no guarantee that different procedures yield similar outcomes; therefore, we applied multiple procedures. The MatchIt R package (Ho, Imai, King, & Stuart, 2007) was used to do the matching and to estimate propensity scores. Then, DIF analyses were reconducted using matched group data produced by each matching method.

Expected Outcomes

Results In the original data, only item 3 was flagged as having DIF in both SEM and LR for Indonesian and Turkish students, and Indonesian and Dutch students. Between Turkish and Dutch students, items 2 and 4 were flagged as having DIF. These findings were used as a basis to evaluate results found by matching methods. Nearest neighbor and optimal matching methods produced the same results in the DIF analysis. When we compared items flagged in the original data, there was no clear pattern of diminishing DIF. As the balance evaluation suggested that the matching of the data was not adequate, finding the same DIF results in the matched data as in the original data was not surprising. As propensity score matching could not produce a good match generally, especially for countries that were very different on background variables, the actual propensity scores were used to flag DIF items. In LR, propensity scores and the interaction between propensity score and group membership were added to the equation to compute DIF. In SEM, propensity score was added as predictor of outcome of each item. For LR, results involving Indonesian data showed the same pattern and there was no reduction in the number of biased items flagged. However, when comparing Turkish and Dutch student, all the items originally flagged as DIF were eliminated. For SEM, using propensity score as predictor eliminated all DIF for Indonesian-Turkish, Turkish-Dutch, and Australian-Dutch comparisons. Additionally, for Indonesian-Dutch comparison, the number of items showing DIF decreased from five to two, and for Indonesian-Australian comparison, decreased from two to one. Overall, using propensity scores as predictor in DIF detection is found to be an effective method to reduce the number of items showing DIF. Our analyses strongly suggest that the propensity variables should be added to a DIF analysis.

References

References Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal of Educational and Behavioral Statistics, 18, 131-154. Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric preprocessing for Parametric Causal Inference. Retrieved from http://r.iq.harvard.edu/docs/matchit/2.4-20/matchit.pdf. OECD (2014b). PISA 2012 Technical Report. Paris, France: OECD Publishing. Schmidt, W. H., & Maier, A. (2009). Opportunity to learn. In G. Sykes, B. Schneider, & D. N. Plank (Eds.), Handbook of education policy research (pp. 541–559). New York, NY: Rutledge for American Educational Research Association. Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1-21. van de Vijver, F. J. R., & Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross- cultural assessment. European Journal of Psychological Assessment, 13(1), 29-37.

Author Information

Serkan Arikan (presenting / submitting)

Mugla Sitki Kocman University

Muğla

Fons van de Vijver

Tilburg University, The Netherlands

Kutlay Yagmur