Session Information
09 SES 01 A, Investigating Achievement in Different Context
Paper Session
Contribution
There is an increasing gap between rich and poor worldwide and the educational systems must provide the students with capabilities to live adequately under any social condition. An essential point for this is to assess the students fairly across different social groups. In item response theory (IRT), the equal functioning of the items in different subpopulations is very important to have test consistency, which it is associated with the invariance property across subgroups. Otherwise, the test would be favoring some of the populations, e.g. students with high social conditions.
In the ideal case, the parameters that define the item characteristic curve (ICC) should be similar. The absence of such behavior is known as differential item functioning (DIF). The analysis of DIF started in the 1960s, but it was only until the 1990s that numerous DIF detection methodologies were proposed, focusing on the comparison between a base group (focal) and a second group (reference). Particularly, when the scoring model is a 2 or 3 parameter logistic (2PL or 3PL) IRT model, the identification of items with DIF is more complex.
In the case of Colombia, students are classified according to their socioeconomic level in four categories. Therefore, it is important to verify that there is no DIF across the four groups. However, the common methodologies for DIF detection are weak and insufficient when there are more than two comparison groups, producing large Type I errors and involving a considerable computational cost. When DIF is evaluated in more than two groups, it is usually called multiple-group DIF (Raju, 1988; Raju et al., 1995; Raju et al., 2009, Oshima et al., 2015).
In this document, we employ the methodology of Oshima et al. (2015) for multiple-group DIF. Such a methodology is not commonly applied by researchers. In practice, only two groups are compared for DIF or the subpopulations are recategorized to create only two subgroups. Therefore, we show the utility of the method in Oshima et al. (2015) combined with other strategies to study DIF across socioeconomic levels in Colombia. In addition, we study DIF for a categorization related to zone-sector which creates three populations of schools: non-governmental, governmental-urban and governmental-rural. It is important to study DIF across these sub-groups of students to guarantee fairness in the evaluation process.
The design of the test is based on a balanced blocks design (BIBs) (Bose, 1939; Yates, 1936; Youden, 1972) . Each block contains a set of items and is constructed with estimates of previously calibrated item parameters with a 3PL model. The blocks are distributed in pairs in a random way, creating groups of blocks known as forms with 44 items. This allows to have a valid and reliable tool for estimating students’ skills (Lord, 1980). However, the items are placed in different positions across forms and the graphical design is also different in each form. Therefore, it is also of interest to determine if there is DIF for the items across forms.
The data used in this study correspond to the test Saber 11, which is applied to students in the last grade of high-school. This is one of the standardized tests applied by the Colombian Institute for Educational Evaluation (ICFES) to assess the quality of high-school education. We use data from two applications in 2017 for four subjects: natural sciences, critical reading, mathematics and social and citizens areas. The objective is this study is to evaluate multiple-group DIF across sub-groups of students defined by the socio-economic status, the zone-sector variable and the forms for the test Saber 11 in 2017.
Method
We use the statistic proposed by Oshima et al. (2015), which is based on a multiple group non-compensatory index methodology (MG-NCDIF). It detects DIF among subgroups with respect to a base group. The groups that want to be compared are called references groups. The method selects as base group a representative sample of test takers across all reference groups. In the procedure, the calibration of the items is obtained for each of the subgroups and a common scale is defined relative to the base group via the equating method. For all G pairs (base group and each g-reference group), the statistic calculates the difference between the focal group ICC and the reference groups ICC, weighted by the distribution of the latent trait (usually following a normal distribution). The statistic MG-NCDIF is computed as the mean of these differences. An important point in the proposal of Oshima et al. (2015) is the sample size of the representative sample. The authors suggest that the sample size of the base group should be the average of the size across reference groups. The sample size is an essential topic because MG-NCDIF may be unstable when the sample size differs strongly across subgroups. The statistic proposed by Oshima et al. (2015) does not have a known distributional form, although sometimes it approximates a chi-squared distribution. Therefore, statistical hypothesis testing is based on a Monte Carlo statistical methodology called Item Parameter Replication (IPR). Such a method is based on the proposal in Raju et al. (2009), which was implemented and improved by Cervantes (2017). We employ this methodology for testing free differential functioning under the null hypothesis. Additionally, the statistic in Oshima et al. (2015) is classified using a proposed Effect Size (ES), which follows the foundations of the classification for the ETS delta in Dorans and Holland (1993). The classification corresponds to three categories: (A) items with negligible or nonsignificant DIF, (B) items with slight to moderate DIF magnitude, and (C) items with moderate to large DIF magnitude. The previous methodology was applied to the Colombian test Saber 11 for 2017. We analyzed DIF across subpopulations defined by the socioeconomical level (4 categories), Zone-Sector (3 categories) and forms (14 categories). The calibrations were obtained in BILOG-MG. The equating process and statistical computations were carried out in R using the packages SNSequate and DFIT.
Expected Outcomes
The methodology proposed by Oshima et al. (2015) for multi-group DIF analysis is an important tool for detecting DIF in the logistic IRT model. It is important for prove the assumption of invariance among subpopulations. This tool can be used for international tests like PISA and TIMSS to evaluate the functioning of the items across countries or to detect DIF in subgroups defined by social conditions or other inequality characteristics of the populations. When applying the MG-NCDIF method for different forms, we found few items with DIF in the test Saber 11 test in 2017. Natural Sciences and Social and Citizens had around 10% of items (5 or 6 items) in the categories moderate (B) and large (C) DIF effects, while Critical Reading and Mathematics Citizens had around 5% of items (2 or 3 items). The reasons of the differences could be the length of the items or the position in the form, among others. In general, the subjects Critical Reading and Mathematics presented less items with DIF. The multi-group DIF analysis comparing subpopulations defined by socioeconomic level and zone-sector did not show large DIF effects (C). The result showed moderate (B) effects around 5% of items (2 or 3 items). Hence, the test in each area had 95% of the items free of DIF. The differences across the four subjects in this analysis suggest a qualitative analysis about the content and context of the items in each subject. Nevertheless, the results suggest that students in different socioeconomic levels and zone-sector categories are being assessed fairly and DIF is present in very few items. These conclusions are obtained based on the proposal of Oshima et al. (2015), which is not commonly applied in practice. Therefore, in this document we also wanted to show the applicability of such a method.
References
-Bose, R. C. (1939). On the construction of balanced incomplete block designs. Annals of Eugenics, 9(4):353–399. -Cervantes, V. H. (2017). DFIT: An R Package for Raju’s Differential Functioning of Items and Tests Framework. Journal of Statistical Software, 76(5), 1-24. doi:10.18637/jss.v076.i05 -Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc. -Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge. -Oshima, T., Wright, K., and White, N. (2015). Multiple-group noncompensatory differential item functioning in raju’s differential functioning of items and tests. International Journal of Testing, 15(3):254–273. -Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4):495–502. -Raju, N. S., Fortmann-Johnson, K. A., Kim, W., Morris, S. B., Nering, M. L., and Oshima, T. (2009). The item parameter replication method for detecting differential functioning in the polytomous dfit framework. Applied Psychological Measurement, 33(2):133–147. -Raju, N. S., van der Linden, W. J., and Fleer, P. F. (1995). Irt-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4):353–368. -Wright, K. D. and Oshima, T. (2015). An effect size measure for raju’s differential functioning for items and tests. Educational and psychological measurement, 75(2):338–358. -Yates, F. (1936). Incomplete randomized blocks. Annals of eugenics, 7(2):121–140. -Youden, W. (1972). Use of incomplete block replications in estimating tobacco-mosaic virus. Journal of Quality Technology, 4(1):50–57.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.