Session Information
09 SES 02 A, Substantive and Methodological Issues in Assessing Social and Civic Skills and Mathematics
Paper Session
Contribution
In our rapidly changing world, computers have become very advantageous and commonly used in many situations. In this context, the assessment of education is not the exception, where the computer-based test (CBT) is useful to avoid the waste of paper and to record additional information such as the time employed by the students to answer each item and the order followed by the students to go through the questions. However, it is important to verify the equivalence of CBT with the traditional paper-pencil test (PPT) in order to guarantee comparability of the results. In this way, it is possible to compare historical results before and after making a transition from PPT format to CBT.
Therefore, we use differential item functioning (DIF) methods to identify possible differences in the standardized Colombian test Saber359 when applying it in CBT and PPT formats. These possible differences could be associated to the way the question is presented in CBT and whether the student will make the same effort for finding the answer in both formats, or on the contrary, CBT demotivates the student to write down the procedure in paper to find the answers. This is especially relevant in mathematics tests.
This study is observational in the sense that examinees were not assigned randomly to CBT or PPT. Some schools with technological facilities decided to employ CBT for their students. Clearly, this could induce a type of bias since students coming from schools with technological facilities may be different in average from students in regular schools. For this reason, we employ matching techniques to find populations that are similar before studying DIF of the items between CBT and PPT.
Saber 359 is a standardized national assessment in Colombia. It is applied annually to students in grades 3, 5 and 9 to evaluate the development of basic skills in language and mathematics mainly. In 2017, some schools decided to follow the CBT format for the test, so this sample does not correspond to a random assignment of the students to the two formats. In this research we analyze the mathematics test for students in 5th grade. In total, 22226 students followed PPT and 1940 examinees were assessed with CBT. The test was composed by 44 items which were exactly the same for the two modalities (CBT and PPT).
In order to obtain comparable samples, two matching methodologies were used: Propensity Score Matching and Mahalanobis Matching by genetic algorithm. After doing matching, DIF analyses with item parameter replication (NCDIF) and logistic regression (LR) were carried out to detect items with DIF. We found DIF for many items before using matching techniques, but the number of items with differential functioning was importantly reduced after matching was implemented. This shows the importance of having similar populations when comparing CBT and PPT formats.
In addition, there were three components for the mathematics test (numeric, spatial, random). We found more items with DIF is some specific components. We will analyze this more in deep and try to characterize the items with DIF to give some hypothesis of what is causing such differential functioning between CBT and PPT. This can help to make an easier transition from PPT to CBT, because this latter modality will probably be preferred in the future given all the advantages that computers offer.
Method
Matching methods are commonly used in non-experimental studies to reduce the bias induced by the underlying mechanism of selection (treatment vs non-treatment), which is not known in the analysis. Rosenbaum and Rubin (1983) proposed to do matching based on probability measures that would allow the mechanism of selection to be ignored. For each observational unit in the treatment group, the methodology tries to find a non-treatment unit that has similar characteristics by using the propensity score as a measure of distance. We analyze two common approaches in this document: The Propensity Score Matching proposed by Rosenbaum and Rubin (1983) and the Mahalanobis Matching by genetic algorithm in Cochran and Rubin (1973); Diamond and Sekhon (2013). The DIF analysis refers to methodologies that allow us to identify differences in the functioning of an item between two or more sub-populations of interest. Therefore, DIF analysis is a preliminary tool for evaluating the invariance property of a test and to guarantee comparability of the results across sub-populations in the study. In this paper, we study differences for the two sub-populations associated with the test modalities, i.e. CBT and PPT. Wiberg (2007) presented a theoretical review about different methodologies commonly used for DIF analysis. Based on this, we focus on two DIF methodologies: Logistic Regression (LR) (Swaminathan and Rogers, 1990) and Non-Compensatory Index methodology (NCDIF) (Raju et al., 1995). In this study, there is a high risk of bias in the conformation of the groups, because schools with technological facilities in CBT are likely to teach students with a high socio-economic status. Our proposal is to make an initial control over this variable according to the ideas in Rubin (1973). We also took as control variables the tuition fee at the school, gender of the student, number of laptops at the institution and whether the school is governmental or non-governmental. We used these auxiliary variables to do a previous matching that would result in a PPT sample with a size similar to the sub-population that followed CBT. Then, with two comparable samples for CBT and PPT obtained by matching, we carry out a DIF analysis based on LR and NCDIF methodologies to identify items that may be problematic for the equivalence between CBT and PPT modalities.
Expected Outcomes
When implementing the matching methodologies prior to DIF analysis, we found that Mahalanobis Matching by genetic algorithm tries to find more similar sub-populations for the comparison. The amount of paired students using this method was 1120 whereas this number corresponds to 1940 for the Propensity Score Matching. Looking at the average of the auxiliary variables employed for the matching, we can see that the average for these variables is closer for CBT and PPT when using Mahalanobis Matching. Nevertheless, the differences between CBT and PPT averages are importantly reduced by these two methods compared to the original sub-populations with no matching. Carrying out DIF analysis with no matching indicates that 55% of the items present DIF based on LR whereas it is 48% when using NCDIF. This number is extremely large in both cases and indicates serious problems to compare CBT and PPT formats. Each DIF method is based on different statistical methods and there might be low to medium coherence among DIF techniques (Atalay et al., 1973). Therefore, we considered as items with DIF those which showed DIF according to both methods. This percentage is 30% when using the original sub-populations with no matching. After implementing matching, this percentage decreases to 23% when using Propensity Score Matching and to 20% based on Mahalanobis Matching. These results show the importance of employing matching techniques in order to obtain comparable sub-populations and guarantee that items with DIF are caused by the variable of interest, which is type of application (CBT or PPT) in this case. Given that 20-23% of the items present DIF between CBT and PPT, it is possible to try to make comparisons between the two types of test, especially after analyzing if the DIF is compensated in average and it is not favoring one of the two groups.
References
Atalay, K., Gok, B., Kelecioglu, H., and Arsan, N. (1973). Comparing different differential item functioning methods: A simulation study. Biometrics, 29(1):159–183. Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhya: The Indian Journal of Statistics, Series A, 35(4):417–446. Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics,95(3):932–945. Raju, N. S., Van der Linden, W. J., and Fleer, P. F. (1995). Irt-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4):353–368. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 29(1):159–183. Swaminathan, H. and Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational measurement, 27(4):361–370. Wiberg, M. (2007). Measuring and detecting differential item functioning in the criterion-referenced licensing test. a theoretic comparison of methods. Institutionen for beteendevetenskapliga matningar.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.