Analysis of Differential Item Functioning on Some TIMSS 2011 Data for Ukraine

Author(s):

Tetiana Lisova(presenting / submitting)Yuriy Kovalchuk

Conference:

ECER 2015

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 08 A, Theoretical and Methodological Issues in Tests and Assessments (Part 1)

Paper Session to be continued in 09 SES 12 A

Time:

2015-09-10

09:00-10:30

Room:

326. [Main]

Chair:

Eugenio Gonzalez

Contribution

In international comparative studies such as TIMSS or PISA, confidence that the test measures the same construct and is fair for different countries or cultures, is particularly important. At the same time, there are many possible causes of bias in such large-scale assessments: problems of translation tests in different languages, cultural features of countries, differences in curricula and teaching methods. In addition, the use of unidimensional models for scaling these results leads to the fact that a large number of items show the bias. A deep analysis of the functioning of each item allows to better understand the country's position in the international ranking and to indicate the ways to improve it.

Item bias can be detected by a procedure of differential item functioning (DIF). DIF “occurs when examinees from groups R (reference) and F(focal) have the same degree of proficiency in a certain domain, but difference rates of success on an item” (Camilli, 2006). Depending on interaction between group membership and the ability levels two classes of DIF are distinguished: uniform and non-uniform. In case when there is no interaction found it is uniform DIF, otherwise non-uniform DIF is present.

Many statistical methods for determining DIF are developed. Some of them are based on observed scores, others are based on estimation of the ability level which is obtained using the models of IRT. Some of them identify better uniform DIF, others detect more efficiently non-uniform DIF. Among these methods there is no universal one, any of them has certain advantages and disadvantages. In this paper, we consider two nonparametric methods: Mantel-Haenszel (MH) and Simultaneous Item Bias Test (SIBTEST) and also two parametric methods: Item Response Theory Likelihood Ratio (IRT-LR) and Logistic Regression (LR). MH uniform DIF detection procedure (Holland & Thayer, 1988) is based on analysis of contingency tables. The MH statistic is a chi-square that tests the null hypothesis of no DIF between the groups. With SIBTEST (Shealy & Stout, 1993), the complete latent space is viewed as multidimensional, (θ, η), where θ is the unidimensional target ability and η is the extraneous abilities. True scores for both groups are estimated using linear regression and β statistics is used to test the null hypothesis of no DIF. Both methods allow to estimate the amount of DIF and to classify DIF as negligible, moderate, or large. IRT-LR (Thissen, Steinberg & Wainer, 1988) is based on comparison of the fit to IRT models using the likelihood test statistic. It may detect DIF that arises from differential difficulty, differential relations with the construct being measured, or even differential guessing rates. LR method (Swaminathan & Rogers, 1990) by successive comparison of regression models allows to reveal a uniform and non-uniform DIF.

Although the appearance of DIF is a necessary but not sufficient condition for bias, researching it is very useful for a better understanding of the items functioning in groups and detection of possible problems.

This research had two objectives. The first task was to compare the possibilities of different methods and tools of investigating DIF. The next purpose was to analyze the DIF study results obtained by various methods for mathematical test items of TIMSS 2011 in Ukrainian group compared to those from USA and Russian Federation.

Method

The data of TIMSS 2011 for grade 8 from booklet 2, all items of which were released, were used for the study (http://timssandpirls.bc.edu/). Booklet 2 contains 32 items in mathematics from all sections and cognitive levels in the same proportion as whole item pool. Among them there are 15 multiple-choice items and 17 constructed-response items, which were evaluated 0, 1 or 2 scores. 249 participants from Ukraine, 742 from USA and 354 from Russian Federation took test from booklet 2 with mean total score on mathematics respectively 15.8, 16.6 and 20.5. In the first stage (test level), factor test structure was investigated for each country separately in Winsteps (J. Linacre) using a standard Principal-Components Analysis of residuals (without rotation and with orthogonal axes). No significant evidence of multidimensionality was found at this stage for any country. Next analysis DIF for each item (item level) was conducted by different methods. Results of DIF analysis by MH and Rasch-Welch methods were also obtained in Winsteps. Program Poly-SIBTEST of package DIFPAK v.1.7 (Stout, 2005) was used to detect uniform DIF by SIBTEST. Crossing-SIBTEST was used to detect nonuniform DIF in dichotomous items. Program IRTLRDIF v.2.0 (Thissen, 2001) performs the algorithm of pairwise comparison of full and reduced models for each parameter by IRT-LR test. It allows the use of 3PL model for multiple-choice items and Graded Model (Samejima, 1969) for constructed-response items. LR method was realized in SPSS according to Zumbo’s technique (1999) with updated macro. Nagelkerke R2 statistic was used for the estimation of effect size. All conclusions were obtained at 0.05 significance level. Percent of matching for each pair of methods was calculated. The highest percentage of matching (88%) was discovered between MH and SIBTEST methods, the lowest percentage (70%) was found for SIBTEST and LR pair. At the last stage content analysis was conducted of the items that were classified as those having DIF by all methods and for which at least three methods detected large DIF. This analysis helped to better understand the reasons of the low rating of Ukraine.

Expected Outcomes

When comparing groups of Ukraine and the USA from 50% to 63% of items showed varying degrees of DIF depending on the method of DIF detecting. LR and IRT-LR methods revealed non-uniform DIF for three items that was not detected by MH and SIBTEST. The lowest number of items (three) with large DIF was detected by LR, however for these items all other methods also indicated large DIF. On two of these items that concerned calculating the probability of a random event US participants performed significantly better, while the geometry item on drawing a triangle favored Ukrainian students. For another three items LR method detected moderate DIF, and all other methods indicated large DIF. On the item concerning figure turn and the charts reading item Ukrainian students performed better, whereas when solving linear inequality they showed significantly worse results. In groups of participants from Ukraine and the Russian Federation fewer items (34%) were detected of DIF, as students in these countries have similar study programs for historical reasons. On solving linear inequalities Ukrainian participants also performed worse, whereas on two geometry items they showed better results than both US and Russian groups. In most cases DIF are easily explained after analyzing curriculum. For example, solving linear inequalities is studied in Ukrainian schools on the beginning of 9th grade, while Russian students study this topic at the end of 8th grade. At the same time the process of implementation of a random event probability concept to curriculum is equally struggling in both countries due to many years of stochastic line absence in school mathematics course.

References

1. Ayala, R.J. (2009) The Theory and Practice of Item Response Theory. New York, London: The Guilford Press. 2. Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement, (pp. 221–256). Westport, CT: ACE/ Praeger series on higher education. 3. Crocker, L., Algina, J. (1986) Introduction To Classical And Modern Test Theory. New York: Holt, Rinehart and Winston. 4. Linacre, J. (2011) A user’s guide to Winsteps. Retrieved from: http://www.winsteps.com/winman/index.htm?guide.htm. 5. Ministry of Education and Science of Ukraine, (2012) Mathematics. The curriculum for students 5 - 9 classes of secondary schools. Retrieved from: http://www.mon.gov.ua/ua/activity/education/56/692/educational_programs/ 6. Mullis, I., Martin, M., Foy, P., Arora, A. (2012) TIMSS 2011 International Results in Mathematics. Retrieved from: http://timss.bc.edu/timss2011/downloads. 7. State Standard for basic and secondary education, (2004), Mathematics in School, № 2. 8. Stout, W., Roussos, L. (1995) SIBTEST manual. Champlain, IL: University of Illinois, Department of Statistics, Statistical Laboratory for Educational and Psychological Measurement. 9. Thissen, D. (2001) IRTLRDIF v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Retrieved from: http://www.unc.edu/~dthissen/dl.html 10. Yildirim, H.H. (2006) The differential item functioning (DIF) analysis of mathematics items in the international assessment programs. A thesis submitted to the graduate school of natural and applied sciences of Middle East Technical University. Retrieved from: http://etd.lib.metu.edu.tr/upload/12607135/index. 11. Zumbo, B. D. (1999) A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from: http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html.

Author Information

Tetiana Lisova (presenting / submitting)

Nizhyn State Mykola Gogol University

Physics and Mathematics Faculty

Nizhyn

Yuriy Kovalchuk