Session Information
09 SES 12 B, Reimagining Assessment Practices and Teacher Autonomy
Paper Session
Contribution
We report ongoing research that assesses how well AI can evaluate teaching, which we define as “effective” to the degree it helps students learn. Our current research builds on a body of prior work in which we assessed how well human judges performed the same task. Under varying conditions (length of instructional sample; instruction documented as video, audio, and transcript; and judgments based on intuition alone, high-inference rubrics, and low-inference rubrics) human judges demonstrate significant limitations. Experts and nonexperts did no better than chance when they relied solely on their intuitive judgment. Experts fared no better when using high-inference rubrics. However, experts and nonexperts were more accurate than chance when they used low-inference rubrics, and just as accurate using transcripts of instruction compared to video. Machines are very good at performing low-inference tasks, and AI in particular is very good at “understanding” written text, such as transcripts. Is AI better at judging teaching effectiveness from transcripts than humans? If so, should human judges be replaced by machines? We provide data that may help answer these questions, and engage our audience in a discussion of the moral dilemmas it poses.
Method
We investigate two types of evaluative judgments—unstructured and structured. Unstructured judgments were investigated by asking subjects to “use what they know” to classify classroom instruction of known quality as being of either high or low effectiveness. Structured judgments were investigated by asking subjects to count the occurrences of six concrete teaching behaviors using the RATE rubric. The performance of two groups of subjects are compared—human judges and AI. The tasks with human subjects are replications of experiments we previously conducted and published (Strong et al, 2011; Gargani & Strong, 2104; 2015). We are, therefore, able to compare the performance of AI and humans on the same tasks at the same time, as well as to human judges in previous studies. A contribution of our work concerns the difficult problem of developing prompts for AI that instruct it to complete the evaluation tasks. Our protocol is iterative—we developed and piloted prompts, revised them, piloted again, and so on until satisfied that any failure to complete a task well would not be attributable to weaknesses in the prompts. We developed our own criteria for prompts, which we will share. One hundred human subjects were recruited to act as a benchmark for the AI, and they use an online platform to complete the tasks. Comparisons of accuracy and reliability will be made across groups and tasks, providing a basis for judging the relative success of AI and human judges.
Expected Outcomes
We hypothesize that the use of lesson transcripts versus video or audio only will reduce the sources of bias such that humans will be able to more accurately distinguish between above-average and below-average teachers. We further hypothesize that AI will be more accurate than humans, and can be successfully trained to produce reliable evaluations using a formal observation system.
References
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 470-428. Strong, M. (2011). The highly qualified teacher: What is teacher quality and how do we measure it? New York: Teachers College Press. Strong,M., Gargani, J., & Hacifazlioğlu, Ö. (2011). Do we know a successful teacher when we see one? Experiments in the identification of effective teachers. Journal of Teacher Education, 20(10), 1-16.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.