Session Information
09 SES 12 A, Theoretical and Methodological Issues in Tests and Assessments (Part 2)
Paper Session continues from 09 SES 08 A
Contribution
Outline of PISA
The PISA study (Programme for International Student Achievement) is a very large worldwide survey conducted by the Organisation for Economic Co-operation and Development (OECD). It was first conducted in 2000 and has been repeated every three years since. PISA assesses literacy in reading (in the mother tongue), Mathematics and Science. In these areas PISA mainly assesses 15-year-old students’ capacities to use their knowledge and skills in order to meet real-life challenges, rather than merely looking at how well they have mastered a specific school curriculum. The test has been translated (and/or adapted) into more than 40 different test languages equivalent to the English and French source versions developed by the PISA consortium. In each cycle, one of the assessment areas is the major domain consecutively. Particularly Science was the major domain in 2006 (cycle 3) and is going to the major domain again in 2015 (cycle 6).
PISA test design and IRT models
In PISA study, cognitive items have been organised into different test form booklets by a linked design (for example, see OECD, 2014). And students have been assigned to do the test booklet randomly. The item formats in PISA include multiple-choice and open-ended (short-answer or extended response).
As item response theory (IRT) models have been widely used in analysing and constructing educational and psychological tests. PISA data have been analysed based on the IRT partial credit model (Masters, 1982) which was an extension of the simple logistic Rasch model (Rasch, 1960/1980).
The Partial Credit Model (PCM) or one-parameter Partial Credit Model developed for polytomous scored items. Correspondingly, by adding a discrimination parameter into each item Muraki (1992) generalised this model as two-parameter partial credit model or named as generalized partial credit model (GPCM)
From the model formula it is expected that the more parameters added the more likely the model-data goodness-of-fit statistics achieves better (Fitzpatricket, Burket, Ito, & Sykes 1996; Harris, 1989). However, it may be not true for the stability of the item parameter estimates across the examinee groups. Item parameter invariance may not be guaranteed by the mere fact that an IRT model is fit to data (van der Linden & Hambleton, 1997; Engelhard, 1994).
Choosing an appropriate model for the test data is an essential element in assuring their quality. Together with fit statistics, the stability or invariant level of the item parameter estimates across examinee groups is an important criterion for selecting an appropriate IRT model for the data set.
Study objectives
This study focuses on investigating the stability of item parameter estimates from PCM and GPCM respectively, by the test form booklets across the country and language groups.
From that, it could suggest whether the item discrimination parameter should be needed in the item calibration model for large and heterogeneous samples of an international test like PISA, particularly for PISA Science 2015.
Method
Expected Outcomes
References
Adams, R.J., Wu, M.L., & Wilson, M.R. (2012). ACER ConQuest 3.0. [computer program]. Melbourne: ACER. Engelhard, G., Jr. (1994). Historical views of the concept of invariance in measurement theory. In M.Wilson (Ed.), Objective measurement: Theory into practice (Vol. 2, pp. 73-99).Norwood, NJ: Ablex. OECD (2014). PISA 2012 Technical Report. OECD. Paris. Fitzpatrick, A. R., Link, V. B., Yen, W. M., Burket, G. R., Ito, K., & Sykes, R. C. (1996). Scaling performance assessments: A comparison of one-parameter and two-parameter partial credit models. Journal of Educational Measurement, 33(3), 291-314. Harris, D. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35-41. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176. Rasch, G. (1960/1980). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Education Research, 1960. (expended edition, Chicago: The University of Chicago Press, 1980.) Thissen, D., Chen, W-H, & Bock, R.D. (2003). Multilog (version 7) [Computer sotware]. Lincolnwood, IL: Scientific Software International. van der Linden,W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer-Verlag.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.