The development, implementation of a Multi-Stage Computer Based Assessment: Theory vs Implementation– Investigation of Reliability and Validity

Author(s):

Frances Eveleigh(presenting / submitting)Chris Freeman(presenting)

Conference:

ECER 2016

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 11 C, Methodological Issues in Tests and Assessments

Paper Session

Time:

2016-08-25

17:15-18:45

Room:

NM-F107

Chair:

Jan-Eric Gustafsson

Contribution

The possibilities that Computer Based Testing brings to large-scale assessment broadens the breadth, usage and utility of such programs.

National full-cohort assessments provide a wealth of diagnostic information for all key stakeholders. As the trend for large-scale assessment is to move closer to online adaptive testing, the question of the reliability and validity of the tests must be paramount as researchers collect data, which evidences whether online assessments are measuring what they purport; measuring a continuity or derivative of traditional pen-and-paper forms; or whether perhaps new variables are being tested.

Pre-existing scales developed from proven paper and pencil instruments and analysis methods have embedded a traditional view of the skills and content deemed assessable and measureable. Worthy of investigation is the issue of transferability of these scales to the online medium and whether student abilities are captured and measured appropriately and consequently reported accurately to stakeholders.

This study examines the design considerations in the development of an online assessment instrument, and compares the expected outcomes with the actual findings in terms of the distribution of student abilities. Grounded in current research, the structural and design development was initially guided by previous studies then adapted and refined to serve the requirements of the users and stakeholders.

As part of the design and development process, a mode-effect study was undertaken to ascertain whether significant differences existed between pen-and-paper and computer screen formats. The results were analysed to determine whether the delivery mode affected the parameters of the items within the tests, and the choice of items included in the final instruments determined using these data.

Consideration was given to the branching design involving set modules, the optimum number of items within initial and subsequent modules, and specific subject domain constraints. As the structure of the assessments evolved using pre-calibrated items, the cut points were theoretically determined, using testlet information curves and item location to raw score derived from Rasch first principles as under-pinning methodologies.

The assessment scales from previous assessment data have defined standards cut scores and suggest the proportions of students that will fall into the pre-defined Levels. Given the emphasis on the importance of the results to stakeholders it is essential that the actual results be interrogated to determine if the predicted outcome is what eventuates; or conversely, does the online branching model explore a tangential variable? Are there issues pertaining to reliability and validity relative to the established scale?

This paper reports an investigation on the accuracy of a Multi-Stage Computer Based Assessment in assigning an appropriate ability as accurately as traditional methods. It examines the use of the online instrument in the calculation of ability estimates of all students, and challenges its validity when compared to the results produced by pen-and-paper instruments.

It examines the appropriateness of the model selected, the determined cut points for branching decisions and ultimately the ranking of students and assigning of ‘Levels’ based on their demonstrated ability on differing sets of items.

Data from a large-scale student assessment from a Middle Eastern region will be compared with a dataset from the online proof of concept of a sample of students matched to the same curriculum based outcomes and using common items.

Using IRT analysis techniques (Rasch 1960), the results of the online trial will be analysed and compared with data from the 2016 of the national assessment in the same domain and Grades.

The findings of this study will help guide the direction of the online assessment by providing recommendations for its future use.

Method

A cohort sample of students of approximately 9000 students per grade (equal gender balance) for each of the 2014 and 2015 testing periods provides the first data source and distribution of Levels. These data were used to determine the cut points for the current online assessment. The second data set is from a Proof of Concept study whereby students undertake an online assessment that branches according to demonstrated ability on the pre-calibrated items. Student ability estimates in four different domains assessed in the 2016 national assessment will be extracted for matched students These estimates will be measured using the partial credit model (Masters, 1988) of the Rasch 1PL model (Rasch, 1960). There are a number of comparisons that will be investigated including: 1) a comparison of the proportion of students in pre-defined ability Levels by domain; 2) mean ability of matched students in each Level; 3) proportion of students in pre-defined ability Levels by Gender; 4) the efficacy of the cut points post hoc and model the outcomes.

Expected Outcomes

The results will be examined to observe if there are any validity issues associated with the use of the Computer Based Assessment and the extent to which the outcomes of the CBA reflect the outcomes in a traditional paper and pencil assessment. The degree to which sub-groups (based on ability) are either advantaged or disadvantaged by the CBA model will also be investigated. At the macro level the study will investigate the extent to which paper and pencil scales are directly transferable to computer based scales. In addition, the investigations and information gleaned will be used to refine the test design of the online assessment instruments, review the existing cut-points, and compare the distributions of students in the pre-defined ‘Levels’. An additional benefit will be to provide recommendations for the future use of the instrument. The outcomes of this research and analyses will be provided to the client and the participants at the conference.

References

• Australian Council for Education Research 2013, Analytical Report: Psychometric Analysis for the Trial of the Tailored Test Design. ACER, Melbourne, December. Prepared for the Australian Curriculum, Assessment and Reporting Authority. • Holling. H., Keynote 6: CAT and optimal design for Rasch Poisson Counts Models. Keynote presentation at IACT Conference, Cambridge 2015. • Masters, G.N. (1988). The Analysis of Partial Credit. Applied Measurement in Education, 1(4), 279-297. Copyright 1988, Laurence Erlbaum Associates, Inc. • Wu, M.L., Adams, R.J., Wilson, M.R., Haldane, S.A. (2007). ACER ConQuest Version 2: Generalised item response modelling software [computer program]. Camberwell: Australian Council for Educational Research.

Author Information

Frances Eveleigh (presenting / submitting)

Australian Council for Educational Research

Alexandria

Chris Freeman (presenting)

Australian Council for Educational Research, Australia

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.