Session Information
09 SES 14 B, Psychometric Approaches to Fair and Valid Achievement Assessment
Paper Session
Contribution
Standards for Educational and Psychological Testing (AERA et al., 2014) specify five sources of validity evidence that can be utilized to assess the validity of a proposed interpretation of test scores for a specific purpose. These sources are based on 1) test content, b) response processes, 3) internal structure, 4) relations to other variables, and 5) testing consequences. The internal structure examines how the relationships among items and the dimensions underlying the test support the proposed interpretation of the scores. In this process, it is analyzed how participants’ responses reflect the intended test structure (Sireci & Benitez, 2023).
This study aims to explore the internal structure of the Pearson Test of English Core (PTE Core) reading section using the Confirmatory Factor Analysis (CFA) approach. The PTE Core, which is a computer-based test, is administered globally to assess general speaking, writing, reading, and listening skills. It provides a score of the test-takers language ability to assist government organizations (e.g., immigration, refugees, and citizenship Canada) that require a standard of English language proficiency (Pearson, 2024). Since test results have significant consequences for individuals, it is crucial to evaluate the validity of test scores and provide evidence to support their use and interpretation. With this study, examining internal structure will provide insights that will contribute to the ongoing validation efforts of the PTE Core test.
The PTE Core uses a form of Linear-on-the-Fly Test Assembly (LOFT) or automated test construction in which test takers take multiple test forms, ensuring test security (Becker & Bergstrom, 2013). In this design, every test taker receives a unique combination of items assembled in real-time from a large item pool. Consequently, the total number of items in the item pool is substantial; however, the number of items each test taker responds to varies, leading to an increase in the amount of missing data. This poses challenges for frequentist CFA model estimation methods, such as maximum likelihood (ML). Bayesian CFA offers an alternative to traditional CFA based on, for example, ML for the assessment of the validity of educational and psychological constructs (Hoofs, et al., 2018). Therefore, we used the Bayesian CFA to examine the factor structure of the PTE Core reading section.
Bayesian analysis is generally used when the models are too complex for traditional models to handle, relatively small sample sizes are available, the researchers want to include background knowledge in their analysis, and there is a preference for the results provided by Bayesian methods(Depaoli et al., 2020). Fundamentally, Bayesian methods are different from traditional methods in terms of the nature of the unknown parameters in a model. Bayesian methods treat these parameters as random variables, possessing a probability distribution that reflects uncertainty about the true value of the parameter (Kaplan & Depaoli, 2012). In Bayesian statistics, there are three essential ingredients: (1) prior distribution, the background knowledge on the parameters of the model tested, (2) information in the data, observed evidence expressed in terms of the likelihood function of the data given the parameters, and (3) combination of first two ingredients, posterior distribution, updated information that balances prior information with observed data (van de Schoot et al., 2014). Researchers may select their priors based on prior publications, Bayesian approaches, or relevant datasets (e.g., secondary data), and some researchers may use their own data to construct priors (Depaoli & van de Schoot, 2017). With its theoretical and practical advantages, Bayesian statistics might help researchers in education and psychology address challenges posed by small sample sizes and complex models, but their implementation can be computationally intensive and requires specialized knowledge.
Method
The sample comprised 1,115 test-takers (51.3% male, 48.7% female; M = 34.69, SD = 8.22). The dataset included participants from various countries, with the highest representation stemming from Brazil, Nigeria, Ukraine, the United States, China, Egypt, India, and Iran (each 10.8%, n = 120). Smaller proportions were from France (3.4%, n = 38), Germany (2.6%, n = 29), Iraq (3.0%, n = 34), and Jordan (4.8%, n = 54). The PTE Core reading section comprises four item types: 1) multiple choice, multiple answers scored using partial credit (e.g., 0, 1, 2), 2) reorder paragraph scored using partial credit (e.g., 0, 1, 2, 3), 3) fill in the blanks scored using partial credit (e.g., 0, 1, 2, 3, 4), and 4) multiple choice, single answer scored using dichotomous scoring (e.g., 0, 1) (Pearson, 2024). The assessment utilized a LOFT design, which included a total of 129 items. However, each test-taker responded to only 9-10 items, resulting in approximately 42-127 individuals answering each item. Consequently, this design led to a substantial number of missing values across items. To address this issue, the Bayesian CFA was employed to examine the factor structure of the reading test. We anticipate identifying a unidimensional factor structure. Previous research (e.g., Pae, 2011) on PTE Academic, a computer-based English test assessing academic-level language competence, supports the idea that the reading section may have one underlying construct. Mplus 8.3 (Muthén & Muthén, 1998-2012) was used to conduct the analysis. The data analysis was organized around three steps: 1) set up the full probability model (including the priors), 2) estimate the posterior distributions, and 3) evaluate the appropriateness of the model and interpret the results (Taylor, 2019). There are three types of priors: 1) non-informative priors, 2) weakly informative priors, and 3) informative priors with specific hyperparameters (van de Schoot et al., 2013). In this study, we used non-informative priors (i.e., no prior information) due to the lack of available information about the parameters. This choice was particularly important because priors have a greater influence on the posterior when the sample size is small (van de Schoot et al., 2014). For model-data fit evaluation, it is recommended to use the posterior predictive p-value (ppp) (Muthen & Asparouhov, 2012), which checks the proportion of iterations for which the replicated χ2 exceeds the observed χ2 (Lee, 2007). A good fit is indicated if the ppp is around 0.50 (Muthen & Asparouhov, 2012).
Expected Outcomes
Bayesian one-factor CFA model with non-informative priors showed an acceptable fit to the data. The ppp-value for the model was 0.47. The 95% confidence interval for the difference between the observed-data test statistic and the replicated-data test statistic was -353.85 and 386.85, indicating that the replicated-data statistic was larger than the observed-data statistic. However, four items had low and non-significant factor loadings. These items were excluded from the analysis, and the model was re-run with 125 items. This model was accepted based on the ppp-value, which was 0.51. The 95% confidence interval for the difference between the observed data test statistic and the replicated data test statistic was -340.189 and 397.019, indicating that the replicated data statistic was larger than the observed data statistic. Standardized factor loading estimates ranged from 0.31 to 0.94. None of the 95% credible intervals included zero, indicating that all items were loaded onto the latent factor. Overall, the results indicated that the PTE Core reading section had a one-factor structure, measuring the reading skills of test-takers. This result is in line with the study of Pae (2011), supported by the one-factor structure for PTE Academic reading items. Further studies are needed to strengthen the validity evidence for PTE Core reading test scores, such as conducting CFA with larger samples or assessing measurement invariance across diverse populations.
References
Becker, K. A., & Bergstrom, B. A. (2013). Test administration models. Practical Assessment, Research & Evaluation, 18(14). Depaoli, S., & van de Schoot, R. (2017). Improving transparency and replication in Bayesian statistics: The WAMBS-Checklist. Psychological Methods, 22(2), 240-261. https://doi.org/10.1037/met0000065 Depaoli, S., Winter, S. D., & Visser, M. (2020). The Importance of Prior Sensitivity Analysis in Bayesian Statistics: Demonstrations Using an Interactive Shiny App. Frontiers in Psychology, 11. https://doi.org/10.3389/fpsyg.2020.608045 Hoofs, H., van de Schoot, R., Jansen, N. W., & Kant, I. (2018). Evaluating model fit in Bayesian Confirmatory Factor Analysis with large samples: Simulation study introducing the BRMSEA. Educational and Psychological Measurement, 78(4), 537-568. https://doi.org/10.1177/0013164417709314 Kaplan, D., & Depaoli, S. (2012). Bayesian Structural Equation Modeling. In R. Hoyle, Handbook of Structural Equation Modeling (pp. 650-673). Guilford Press. Lee, S. Y. (2007). Structural equation modeling: A Bayesian approach. Chichester, England: Wiley. Muthen, B. O., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313-335. https://doi.org/10.1037/a0026802 Muthén, L. K., & Muthén, B. O. (1998-2012). Mplus user’s guide (7th Edition). Muthén & Muthén. Pae, H. (2011). Differential Item Functioning and Unidimensionality in the Pearson Test of English Academic. Pearson. Pearson. (2024). PTE Core Test taker score guide. Pearson. Sireci, S., & Benitez, I. (2023). Evidence for Test Validation: A Guide for Practitioners. Psicothema, 35(3), 217-226. https://doi.org/10.7334/psicothema2022.477 Taylor, J. M. (2019). Overview and illustration of Bayesian Confirmatory Factor Analysis with ordinal indicators. Practical Assessment, Research & Evaluation, 24(4). van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J., Neyer, F., & van Aken, M. (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Development, 85(3), 842–860. https://doi.org/10.1111/cdev.12169 van de Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Hox, J., & Muthen, B. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4, 770. https://doi.org/10.3389/fpsyg.2013.00770
Update Modus of this Database
The current conference programme can be browsed in the conference management system (conftool) and, closer to the conference, in the conference app.
This database will be updated with the conference data after ECER.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance, please use the conference app, which will be issued some weeks before the conference and the conference agenda provided in conftool.
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.