Session Information
09 SES 12 B, Reimagining Assessment Practices and Teacher Autonomy
Paper Session
Contribution
Student Teaching Evaluation (STE) is the procedure by which teaching performance is measured and assessed through questionnaires administered to students. Typically, these questionnaires or scales refer to the teaching practices of academic staff and are conducted in one of the last meetings of the semester. Generally, and from a practical standpoint, the primary purpose of implementing this procedure is the necessity of universities to report STE results to quality assurance agencies. Another main objective of STE procedures, and certainly the most important from a pedagogical perspective, is to provide feedback to teachers about their teaching practices.
Previous studies on the highlighted topic present arguments both for and against the validity and utility of STE. On one hand, there are studies suggesting that STE results are influenced by other external variables, such as the teacher's gender or ethnicity (e.g., Boring, 2017), lenient grading (e.g., Griffin, 2004), or even the teacher's personality (e.g., Clayson & Sheffet, 2006).
On the other hand, there are published works showing that STE scales are valid and useful (e.g., Hammonds et al., 2017; Wright & Jenkins, 2012). Furthermore, when STE scales are rigorously developed and validated, as is the case with SEEQ (Marsh, 1982, 2009), there is a consistent level of agreement and evidence suggesting that STE scale scores are multidimensional, precise, valid, and relatively unaffected by other external variables (Marsh, 2007; Richardson, 2005; Spooren et al., 2013).
Even though this debate was very active in the 1970s and the evidence leaned more in favor of STE validity (Richardson, 2005; Marsh, 2007), a recent meta-analysis (Uttl et al., 2017) presented evidence that seriously threatens the validity of STE results. They suggest that there is no relationship between STE results and student performance levels. The existence of this relationship is vital for the debate on STE validity, starting from the premise that if STE results accurately reflect good or efficient teaching, then teachers identified as more performant should facilitate a higher level of performance among their students.
In light of all the above and referring to the results of the meta-analysis conducted by Uttl et al. (2017), the present study aims to investigate whether the relationship between STE results and student learning/performance is stronger when the STE scale used is more rigorously developed and validated. For this purpose, a multilevel meta-analysis was conducted, allowing us to consider multiple effect sizes for each study included in the analysis.
The results of this study can be useful in nuancing the picture of the validity of STE scales, in the sense that they can show us whether scales developed and validated in accordance with field standards can measure the quality of teaching more correctly and precisely. Additionally, this research can help outline a picture of which psychometric characteristics of STE scales contribute to a better measurement of teaching efficiency/effectiveness.
Therefore, the research questions guiding the present study are as follows:
- What is the average effect size of the relationship between STE results and student performance, in all STE studies with multiple sections published to date?
- Does the average effect of the relationship between STE results and student performance differ based on evidence regarding the validity of the STE scales used?
- Does the average effect of the relationship between STE results and student performance differ based on the content of the dimensions of the STE scales used?
- Does the average effect of the relationship between STE results and student performance differ based on the level of observability of the teaching behaviors in the items that make up the STE scales used?
Method
The present study is a multilevel meta-analysis on the relationship between STE (Student Teaching Evaluation) results and student performance in multi-section STE studies, and on the moderating effect of this relationship, of different psychometric characteristics (level and type of validity evidence of the STE scales, the content of dimensions, and the level of observability/clarity of the items) of the STE scales used in these studies. To be included in this meta-analysis, a study had to meet the following inclusion criteria: 1. Present correlational results between STE results and student performance. 2. Analyze the relationship between STE results and student performance in multiple sections of the same discipline (“multi-section STE studies”). 3. Students completed the same STE scale and the same performance assessment tests. 4. Student performance was measured through objective assessments focusing on actual learning, not students' perceptions of it. 5. The correlation between STE results and student performance was estimated using aggregate data at the section level, not at the individual student level. The search for studies in the specialized literature was conducted through three procedures: 1) analysis of the reference list of similar meta-analyses; 2) examination of all articles citing Uttl (2017); 3) use of a search algorithm in the Academic Search Complete, Scopus, PsycINFO, and ERIC databases. After analyzing the abstracts and reading the full text of promising studies, 43 studies were identified and extracted that met the inclusion criteria. For coding the level of validity evidence of the STE measures used, we adapted a specific framework of psychometric evaluation criteria, proposed by Hunsley & Mash (2008). In adapting the previously mentioned evaluation framework, the recommendations put forth by Onwuegbuzie (2009) and the recommendations of AERA, APA & NCME (2014) were also considered. For coding the level of observability/clarity of the items that make up the STE scales used in the analyzed studies, we created a coding grid based on Murray (2007), which presents and explains the importance of using items with a high degree of measurability to reduce the subjectivity of the students responding to these items. The data were analyzed in R (metafor package) using the multilevel meta-analysis technique because most of the included studies report multiple effect sizes, usually one for each dimension of the STE scale. This type of analysis helps to better calculate average effects, starting from the original structure of the data presented in the primary studies.
Expected Outcomes
The obtained results suggest that: 1) STE (Student Teaching Evaluation) scales with more validity evidence tend to measure teaching effectiveness better; 2) there is a set of dimensions that are more suitable than others for correctly measuring teaching effectiveness (for example, clarity of presentation, instructor enthusiasm, interaction with students, and availability for support had the strongest relationships with performance); and 3) the degree of observability of the items that make up the STE scales is a major factor regarding the ability of these scales to accurately measure teaching effectiveness. Regarding the level of observability of the items contained in the STE scales, they were divided into 3 categories (low/medium/high observability) and the relationship between STE results and student performance was comparatively analyzed for each category. As expected, the moderating effect is significant, meaning that there are significant differences between the correlations obtained within each category of studies. The strongest relationships exist in the case of items with a high degree of observability, and as this degree of observability decreases, the intensity of the correlation between STE results and student performance also significantly decreases. These results can help nuance the picture of the validity of STE scales, suggesting that STE scales developed and validated in accordance with the standards of the field can measure the quality of teaching more correctly and precisely. It can also be said that the proposed dimensionality and the level of observability of the items are of major importance in the development of any STE scale. These recommendations can be useful in any process of development or adaptation of an STE scale for use in the process of ensuring the quality of teaching in the university environment.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Boring, A. (2017). Gender biases in student evaluations of teaching. Journal of public economics, 145, 27-41. Clayson, D. E., & Sheffet, M. J. (2006). Personality and the student evaluation of teaching. Journal of Marketing Education, 28, 149–160. Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction. Contemporary Educational Psychology, 29, 410–425. Hammonds, F., Mariano, G. J., Ammons, G., & Chambers, S. (2017). Student evaluations of teaching: improving teaching quality in higher education. Perspectives: Policy and Practice in Higher Education, 21(1), 26-33. Hunsley, J., & Mash, E. J. (2008). Developing criteria for evidence-based assessment: An introduction to assessments that work. A guide to assessments that work, 2008, 3-14. Marsh, H. W. (2007). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases and usefulness. In P.R., Pintrich & A. Zusho (Coord.), The scholarship of teaching and learning in higher education: An evidence-based perspective (pp. 319-383). Springer, Dordrecht. McPherson, M. A., Todd Jewell, R., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economic Journal, 35, 37–51. Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2), 197-209. Richardson, J. T. (2005). Instruments for obtaining student feedback: A review of the literature. Assessment & evaluation in higher education, 30(4), 387-415. Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598-642. Spooren, P., Vandermoere, F., Vanderstraeten, R., & Pepermans, K. (2017). Exploring high impact scholarship in research on student's evaluation of teaching (SET). Educational Research Review, 22, 129-141. Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42. Wright, S. L., & Jenkins-Guarnieri, M. A. (2012). Student evaluations of teaching: Combining the meta-analyses and demonstrating further evidence for effective use. Assessment & Evaluation in Higher Education, 37(6), 683-699.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.