Session Information
16 SES 07 A, ICT, Language Learning and Media Literacy
Paper Session
Contribution
In the near future, the presence of advanced generative technologies, including ChatGPT and other services that use large language models (LLM), has the potential to greatly impact the field of education and the role of teachers within it. In particular, chatbots can perform four roles: interlocutor, content provider, teaching assistant and evaluator [1].
A notable characteristic of large language models (LLM) is their capacity for further training, wherein the initial model can be adapted and refined to cater to a specific subject area. Specifically, large language models (LLM) can undergo additional training using the written works of specific authors, enabling the creation of a “digital counterpart” of real historical figures.
The application of LLM holds significant potential in assisting both students and teachers in their textual work. For students, LLM can serve as a reviewer when working on creative assignments, offering guidance by identifying obvious and serious mistakes. Likewise, teachers can use LLM to conduct preliminary assessments of students' work and identify areas that require further educational attention [2]. This may be particularly useful when evaluating creative essays, a genre of literature known for its concise format and flexible style of presentation. Although essays have a changeable structure, they generally include an introduction, thesis statement, argumentation, and conclusion.
This research aims to investigate the implementation of LLM as a personal assistant in this context. In order to train LLM on specific data and create a “digital counterpart,” several tasks need to be accomplished:
- Gathering and preprocessing a dataset.
- Establishing evaluation criteria and annotating the dataset accordingly.
- Identifying educational shortcomings in LLM.
- Collecting and constructing a training set based on the “question-answer” principle for further training of the large language model.
The primary research focuses include the criteria for annotation required for subsequent training and potential limitations of LLM for educational purposes.
Method
To evaluate LLM’s effectiveness, a dataset of text essays on two topics was prepared. The first topic involved explaining reasons for selecting a specific profile for master's degree admission and discussing research directions within that profile. The second topic focused on entrance tests in “Socio-psychological mechanisms of the influence of the additional education system on the child giftedness development”, “Mentoring as a method of developing outstanding abilities of students with signs of giftedness”, and “Modern domestic concepts of giftedness” and others. A total of 80 text essays were analysed for each topic. Criteria were established and rated on a scale of 0 to 2 for evaluation, including: • Expression of the author's position regarding the presented problem or topic. • Concise presentation of key points and theses. • Well-reasoned grounds for profile selection and research direction (only applicable to the first topic). The work via LLM involves using the API via the http protocol for communication. Prompt instructions are used to interact with the LLM-powered chatbot and complete tasks. Through iterations, a final prompt is refined to resolve issues and ensure the desired response from the chatbot: “You are a text evaluation system. You have the text and the criteria by which you need to make an assessment. Evaluate the text based on the criteria, based solely on the criteria given. You should only use the attached criteria. Set the final number of points (‘BALLS’) and describe why you set exactly such an assessment (‘BALLS_DESCRIPTION’) using only the presence of criteria in the text. Don’t try to make up the answer”. To evaluate the accuracy [3] of the chatbot’s results, the Mean Absolute Error (MAE) was used as the main metric, along with the 75th quantile of absolute error (AE_75P). Based on the data collected, it can be concluded that the model deviates by an average of one point for most criteria. During grading, it was noticed that the chatbot often gives higher scores, deviating from the grade distribution. To investigate this, the “Pearson contingent coefficient” was calculated to analyse the correlation between nominal indicators X and Y. However, the analysis found no evidence of consistent overestimation. To evaluate the level of agreement among experts, including the chatbot, the “Kendall concordance coefficient” was calculated. This coefficient, ranging from 0 to 1, quantifies the consistency among expert opinions. The analysis concluded that there is minimal agreement between the ratings of experts and the chatbot.
Expected Outcomes
Pre-trained large language models in the form of chatbots can function as teaching assistants by conducting initial reviews of essays and providing feedback on how to correct and enhance the work. This type of solution can be particularly beneficial for teachers, as it allows them to efficiently evaluate students’ work and generate a set of basic comments to address common mistakes. This approach significantly reduces the teacher’s workload and saves valuable time. As the experience of interacting with artificial intelligence systems shows, the effectiveness of the feedback received relies on the accuracy of the request. It is crucial to establish clear evaluation criteria and avoid ambiguous statements in grading scales, such as “clear author’s position” or “partially presented author’s position.” To evaluate the quality of feedback from the chatbot, it is important to have multiple experts assess the essay to ensure consistency in their opinions. In the future, this system has the potential to become a valuable tool for the initial analysis of students’ work. The chatbot can be beneficial for both students, allowing them to assess the quality of their work before submitting it to the teacher, and teachers, providing an objective perspective on the student’s work.
References
1. Jeon, J., Lee, S. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Educ Inf Technol 28, 15873–15892 (2023). https://doi.org/10.1007/s10639-023-11834-1 2. Elkins, S., Kochmar, E., Serban, I., Cheung, J.C.K. (2023). How Useful Are Educational Questions Generated by Large Language Models? In: Wang, N., Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N., Santos, O.C. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky. AIED 2023. Communications in Computer and Information Science, vol 1831. Springer, Cham. https://doi.org/10.1007/978-3-031-36336-8_83
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.