Taking Reading Tests On Paper And Computer – The Analysis Of Mode Effects In Reading Assessments.

Author(s):

Sarah Bürger(presenting / submitting)Ulf Kröhne Frank Goldhammer

Conference:

ECER 2017

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 03 A, Comparing Computer- and Paper-Based-Assessment

Paper Session

Time:

2017-08-22

17:15-18:45

Room:

W5.13

Chair:

Eugenio Gonzalez

Contribution

To benefit from the possibilities of technology-based testing, an existing paper-based assessment (PBA) needs to be transferred to a computer-based assessment (CBA). In longitudinal studies, such as the National Educational Panel Study (NEPS; Blossfeld, Roßbach, & von Maurice, 2011) in Germany, the comparability of ability estimates measured over time are a fundamental requirement for valid interpretations of change scores and precise comparisons of ability distributions between cohorts. Hence, the replacement of PBA with CBA must be prepared carefully and consequences of the mode change need to be investigated.

Previous research revealed heterogeneous mode effects that are not predictable without empirical investigation (e.g., Wang, 2008). The risk of mode effects differs between domains and is increased with the complexity of items, i.e., it can be supposed that the response format is a possible predictor for mode effects, as it may differ in complexity between modes (e.g., Heerwegh and Loosveldt, 2002). For example, assignment tasks are of higher complexity. They are typically used in reading tests, when the assignment of given headings to paragraphs of the text is required. Assignment tasks can be computerized using so-called combo boxes (or drop-down boxes) and were found to be more difficult than assignment tasks on paper tests (Heerwegh & Loosveldt, 2002). Moreover, previous findings give reason to assume that reading tests are more susceptible for mode effects when scrolling in longer texts and navigation between tasks within a unit are required (e.g., Poggio, Glasnapp, Yang, & Poggio, 2005; Pommerich, 2004).

The ongoing transition from PBA to CBA in the NEPS is accompanied by additional experimental mode effect studies to learn more about whether it makes a difference if one takes a reading test on computer or on paper. For this presentation we are analyzing data of two reading tests (for more details see Gehrer, Zimmermann, Artelt & Weinert, 2013) of different grades (seven and twelve) that were computerized and administered in a between-subject design where students were randomly assigned to modes. In addition, each student completed a common PBA reading test of a lower grade as well as a test for basic computer skills (BCS) used as external criteria to inspect construct equivalence.

To evaluate mode effects, appropriate equivalence criteria need to be derived from the intended use of test scores and test score interpretations (Buerger, Kroehne, & Goldhammer, 2016). Therefore, the following research questions were investigated for each test: Do CBA and PBA measure the same underlying construct? Is reliability equal between modes? Are the item parameters invariant between modes? Is there a homogeneous shift in item difficulty on computer? Can mode effects be explained by item properties such as the response format or navigation requirements?

Method

Data was analyzed in R (R Core Team, 2014) and Mplus (Muthén & Muthén, 1998-2015). In the first step a measurement model was fitted for each test in both modes simultaneously in R with the package TAM (Kiefer, Robitzsch, & Wu, 2014). Because the tests included polytomous items, the PCM and GPCM were used for comparison of 1-PL and 2-PL model by comparing the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). The measurement model that holds for both modes prepares the mode effect analysis by providing parameters in which mode differences can be described, for instance differences in item difficulties and item discriminations. Determining a measurement model is also associated with the question whether the same latent construct is assessed with both modes (AERA, APA, & NCME, 2014; Huff & Sireci, 2001; ITC, 2005; Parshall, Spray, Kalohn, & Davey, 2002; Penfield & Camilli, 2007), because it enables investigating construct equivalence in a latent variable model that includes responses from both modes (Buerger, et al., 2016). For analyzing construct equivalence the relation to external variables (AERA, APA, & NCME, 2014) was investigated by comparing latent correlations of the tests with the PBA reading test of lower grade and BCS between CBA and PBA. Multiple-group IRT models were estimated in Mplus using Wald-test statistic to test the equality of the latent correlations. The comparison of the overall reliability of the paper-based test and the computer-based test was carried out by using the EAP reliability obtained from the IRT analysis in TAM (Kiefer et al., 2014). To test for equal item parameters between the modes, multiple-group IRT models were used in Mplus.

Expected Outcomes

The most important result is that the construct did not change when switching to the computer. Latent correlations were equally high between computer and paper tests. Moreover, reliabilities were equally high between the modes. This could be shown for both reading tests. Regarding item difficulties, the grade 12 test showed a homogeneous shift in item difficulty. The probability of a correct response for each item on computer (compared to the paper-based test) decreases on average by about five percent. For the grade 7 test, differences were found for selected items. This mode effect was not homogeneous, but could be explained and systematized by item properties. Investigating the effect of five item properties on the mode difference showed no shift in item difficulty, if reading texts were split on multiple screens, as is often necessary when an existing paper test has to be computerized. Higher difficulties were found for items on first and second position, which had additional navigation requirements on computer. Three different response formats were used in the NEPS reading tests: Normal multiple choice, complex multiple choice and combo box response format. Regarding the difficulties of items with specific response format, only the combo box response format turned out to increase item difficulty. As there is sufficient evidence for equivalence in the measured construct, mode-specific item parameters could be used for at least some items, taking into account the change in difficulty. For the grade 12 test, mode-specific item parameters can be simplified to one mode-specific shift parameter that can be applied to all item difficulty parameters.

References

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for Educational and Psychological Testing. Washington: AERA, APA, NCME. Blossfeld, H.-P., Roßbach, H.-G, & von Maurice, J. (Eds.) (2011). Education as a Lifelong Process – The German National Educational Panel Study (NEPS). [Special Issue] Zeitschrift für Erziehungswissenschaft, 14. Buerger, S., Kroehne, U., & Goldhammer, F. (2016). The Transition to Computer-Based Testing in Large-Scale Assessments: Investigating (Partial) Measurement Invariance between Modes. Psychological Test and Assessment Modeling, 58 (4), 487-606. Gehrer, K., Zimmermann, S., Artelt, C. & Weinert, S. (2013). NEPS framework for assessing reading competence and results from an adult pilot study. Journal for educational research online, Volume 5 (No. 2), 50–79. Heerwegh, D. & Loosveldt, G. (2002). An Evaluation of the Effect of Response Formats on Data Quality in Web Surveys. Social Science Computer Review, 20 (4), 471–484. Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measurement: Issues and Practices, 20 (3), 16–25. International Test Commission (ITC). (2005). International Guidelines on Computer-Based and Internet Delivered Testing. Retrieved from https://www.intestcom.org/files/guideline_computer_based_testing.pdf Kiefer, T., Robitzsch, A., & Wu, M. (2015). TAM: Test analysis modules. (R package version 1.15-0). Muthén, L.K., & Muthén, B.O. (1998-2015). Mplus User’s Guide. Seventh Edition. Los Angeles, CA: Muthén & Muthén. Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002). Practical considerations in computer-based testing. New York: Springer. Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao, & S. Sinharay (Eds.), Handbook of Statistics: Vol. 26. Psychometrics, (pp.125–167). New York, NY: Elsevier. Poggio, J., Glasnapp, D. R., Yang, X. & Poggio, A. J. (2005). A Comparative Evaluation of Score Results from Computerized and Paper & Pencil Mathematics Testing in a Large Scale State Assessment Program. The Journal of Technology, Learning, and Assessment, 3 (6). Pommerich, M. (2004). Developing Computerized Versions of Paper-and-Pencil Tests: Mode Effects for Passage-Based Tests. The Journal of Technology, Learning, and Assessment, 2 (6). R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available from http://www.R-project.org. Wang, S., Jiao, H., Young, M. J., Brooks, T. & Olson, J. (2008). Comparability of Computer-Based and Paper-and-Pencil Testing in K 12 Reading Assessments: A Meta-Analysis of Testing Mode Effects. Educational and Psychological Measurement, 68 (1).

Author Information

Sarah Bürger (presenting / submitting)

German Institute for International Educational Research (DIPF)

Frankfurt

Ulf Kröhne

German Institute for International Educational Research (DIPF), Germany

Frank Goldhammer