Session Information
09 ONLINE 25 A, Tackling Challenges Associated with Electronic Assessment of Reading Proficiency
Paper Session
MeetingID: 848 2182 3236 Code: k8R22C
Contribution
An important role of educational measurement is to inform the classification of individuals for various purposes related to decisions on, for example, professional licensure, admissions, or placement into educational courses of study. Classification decisions are often high stakes as “errors in classification may lead individuals to be deprived of opportunities such as well-deserved educational or career development”. Despite these high stakes, misclassification of individuals is common and may result from measurement errors associated with sampling, equating, the assignment of cut scores, standard setting methods, and standard setting committees. Classification of individuals who obtained assessment scores close to decision cutoff points is particularly critical because misclassification of such individuals is highly likely. Thus, it is incumbent upon test developers to provide evidence of classification accuracy (CA) and classification consistency (CC).
In second language assessment, proficiency testing is probably the area where classification has the largest impact. Increasingly, language proficiency tests are used to determine prospective students’ ability to follow English medium instruction at the undergraduate level. Turkey is no exception, and the growing number of such universities in this context has led to increased scrutiny of the development of institutional English language proficiency tests.
Researchers have commented on the high probability of misclassification involved in traditional PBTs and suggested that computer adaptive testing (CAT) may reduce misclassification by identifying the most informative items in an item bank to increase discrimination around cutoff points, and hence enhance the validity of test-based classification decisions.
CA and CC are two important concepts that define the precision of classifications. CA refers to “the extent to which the true classifications of examinees agree with the observed classifications”. CC is “the rate at which the classification decision will be the same on two identical and independent administrations of the test”. Both CA and CC indices evaluate the classification performance of test takers by calculating the measurement error associated with ability estimates.
Despite the increasingly prevalent use of CATs to classify examinees, CA and CC have predominantly been discussed with reference to linear tests, and these indices are relatively unexplored in the CAT literature. In addition, CAT-based classification has not been investigated with data obtained from real examinees and the effect of systematic manipulation to cutoff points on CA and CC in CAT is currently unclear.
Thus, there is a need for research to examine classification with multiple cutoff points on typical IRT-based CATs, the CAT literature has been focusing on estimating ability. Furthermore, the location of cutoff points on the ability continuum has an important impact on the classification performance of CAT as classification performance is dependent on the cutoff location and increases when the cutoff point is located at the extremes of the ability continuum.
Many researchers use generated data to examine the classification performance of varying-length CAT. However, in practice not every CAT is designed in a varying-length format. Overall, test termination represents a key consideration in CAT design that may influence the classification of examinees.
Research questions are:
- To what extent do CA/CC values differ between PBT and CAT versions of an English reading subtest simulated with different test termination rules?
- What is the effect of the location of the cutoff point on CA/CC values when a binary pass–fail decision is required?
- How do CA/CC values vary when multiple cutoff points are used with respect to single cutoff scenarios?
Method
To answer RQ1, we calculated CA/CC values for different CAT scenarios from the test data to compare classification performance between the CAT and PBT versions of the reading subtest. To answer RQ2, we identified relevant pass–fail cutoff points on the ability continuum, and we calculated CA/CC values to determine classification performance at these points. We conducted an analysis involving multiple cutoff points simultaneously to investigate the effect of adding more than one cutoff point on CA/CC to answer RQ3. The study was conducted at a non-profit university in Turkey. The data were drawn from the students at the English language preparatory school (N=1182). The English proficiency test is produced and administered in the English language preparatory school of the university. The data used come from the PBT version of the university English proficiency reading subtest because examinee item responses were only made available for this subtest by the university. The reading subtest has three parts consisting of a total of six reading texts with multiple-choice items that assess the students’ skills in reading. First, typical premises of the CAT (i.e., a reduction in the number of items administered and individual reliability estimates) were examined with different test termination rules. The first phase included post-hoc (real data) simulations, a common research strategy to investigate the feasibility of CAT in a given situation. In this study, CATs terminated after 10, 15, 20, 25, and 30 items were simulated. Also, five SE rates were selected (i.e., below 0.5, 0.4, 0.3, 0.2, and 0.1) corresponding to different alpha values from .75 to .99 to terminate CAT simulation. The second stage involved classification analysis of simulation results. Different conditions based on two termination rules were tested (i.e., fixed-length termination and standard error termination). For each examinee, final ability estimates and associated standard errors were calculated. To obtain a comparative analysis, the same quantities were also obtained for the real PBT. CA/CC values were estimated by the method proposed by Rudner (2001, 2005). To systematically examine the effect of different cutoff points on the ability continuum, nine cutoff values were set from the 10th to the 90th percentiles, increasing in increments of 10 and using percentile ranks both for CATs and PBT. An additional classification analysis was made using five cutoff points to investigate the potential for the CAT to classify individuals according to the six levels of the English preparatory program.
Expected Outcomes
Results show that premises of CA were observed. When one cut-off points was used at different locations in ability continuum, both CA/CC estimates for all CATs higher than 75%. In all CATs, CA estimates were slightly higher than CC estimates. CA/CC estimates showed an upward trend at the higher and lower and upper tails of ability estimates. CATs with fixed-items produced differing CA/CC values around the middle ability range, whereas SE-based CATs produced relatively more similar CA/CC estimates. At lower and higher ability levels, CAT CA/CC values were similar to the values in PBT; however, around the middle ability levels, the PBT recorded higher CA/CC values. CAT scenarios with less items or higher SE values were associated with relatively lower CA/CC values around middle ability ranges. With five cutoff points, a similar pattern was observed. CATs produced CA estimates greater than 85% and CC values higher than 80% in the middle ability range. For fixed-length CATs, ending after 30 items seemed to produce slightly higher CA/CC values and for CATs terminated after a prespecified SE threshold, CA/CC values slightly increased in all ability ranges. Unlike 9 cut-off points, CA/CC values were different around the high and the low-ability levels. It seems that the higher the ability estimate, the higher the CA/CC estimates were for all CATs regardless of the test termination rule applied. Both for nine and five cutoff points, CAT classification performed was equal to or slightly lower than the PBT version. When nine and five cutoff points were applied simultaneously. CA/CC values significantly decreased compared with classifications with a single cutoff score. Classifications with a single cutoff score recorded CA/CC values ranging from 75% to 95%, whereas simultaneous classification produced CA/CC values between 25% and 60%.
References
AERA, APA, & NCM . (2014). Standards for educational and psychological testing. American Educational Research Association. Babcock, B., Weiss, D. J. (2009). Termination criteria in computerized adaptive tests: Variable length CATs are not biased. In Weiss, D. J. (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing (pp. 1–21). http://iacat.org/sites/default/files/biblio/cat09babcock.pdf Cheng, Y., Morgan, D. (2012). Classification accuracy and consistency of computerized adaptive testing. Behavioral Research Methods, 45(1), 132–142. https://doi.org/10.3758/s13428-012-0237-6 Davey, T., Pitoniak, M. J. (2006). Designing computerized adaptive tests. In Downing, S. M., Haladyna, T. M. (Eds.), Handbook of test development (pp. 543–573). Lawrence Erlbaum. Diao, H., Sireci, S. G. (2018). Item response theory-based methods for estimating classification accuracy and consistency. Journal of Applied Testing Technology, 19(1), 20–25. http://www.jattjournal.com/index.php/atp/article/view/131016 Dunkel, P. (1999). Considerations in developing or using second/foreign language proficiency computer-adaptive tests. Language Learning and Technology, 2(2), 77–93. https://doi.org/10125/25044 Eckes, T. (2017). Setting cut scores on an EFL placement test using the prototype group method: A receiver operating characteristic (ROC) analysis. Language Testing, 34(3), 383–411. https://doi.org/10.1177/0265532216672703 Eggen, T. J. H. (2009). Three-category adaptive classification testing. In van der Linden, W., Glas, C. (Eds.), Elements of adaptive testing. Statistics for social and behavioral sciences (pp. 373–387). Springer. https://doi.org/10.1007/978-0-387-85461-8_19 Eggen, T. J. H. M., Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60(5), 713–734. https://doi.org/10.1177/00131640021970862 Kalender, İ . (2015). Simulate_CAT: A computer program for post-hoc simulation for computerized adaptive testing. Journal of Measurement and Evaluation in Education and Psychology, 6(1), 173–176. https://doi.org/10.21031/epod.15905 Kim, D., Choi, S. W., Um, K. R., Kim, J. (2006, April). A comparison of methods for estimating classification consistency [Paper presentation]. Annual Meeting of the National Council of Measurement in Education, San Francisco, CA, United States. Lathrop, Q. N., Cheng, Y. (2013). Two approaches to estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226–241. https://doi.org/10.1177/0146621612471888 Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17. https://doi.org/10.1111/j.1745-3984.2009.00096.x Lee, W., Hanson, B. A., Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412–432. https://doi.org/10.1177/014662102237797 Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research and Evaluation, 10(13), 1–4. https://doi.org/10.7275/56a5-6b14
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.