Session Information
09 SES 12 A, Theoretical and Methodological Issues in Tests and Assessments (Part 2)
Paper Session continues from 09 SES 08 A
Contribution
Student Evaluation of Teaching (SET) which is usually used to learn students’ experiences with a course and its instructor performance at some point of semester seem to be to most universally used method of gathering data for supposedly both formative and summative evaluation purposes (Penny, 2003; Seldin, 1993; Zabaleta, 2007). SET is a performance assessment. Performance assessment typically requires person being evaluated to display performance and/or construct a response/product and, quality of this output is evaluated by at least one evaluator/rater. Therefore nature of performance assessment can be considered as subjective. “This long, and possibly fragile, interpretation–evaluation–scoring chain highlights the need to carefully investigate the psychometric quality of rater-mediated assessments. One of the major difficulties facing the researcher, and the practitioner alike, is the occurrence of rater variability.” (Eckes, T., 2009, p.4). The term ‘rater variability’ generally refers to variability that is associated with characteristics (ie. lenient, severe) of the raters and not with the performance of person being evaluated (Eckes, 2009). In other words, rater variability that produces constructs irrelevant variance in person scores threatens the validity, reliability and fairness of performance assessment (Lane & Stone, 2006; Messick, 1989). Needlessly to say inter-rater reliability should be as high as possible in such assessment process. Previous research in different settings show that significant rater effects exist in rater mediated performance assessment (Eckes, 2005).
Reliability
Although reliability is considered an important aspect of measurement, there is no consensus on the meaning of reliability and how to calculate the reliability coefficients. Besides, these problems become more visible when the scores come from a performance assessment. There are many procedure/statistics to report reliability of rater-mediated performance assessment results. The reliability coefficients may mean vastly different things. Consensus and consistency are the two broad category names for different reliability coefficients. Consensus reliability (known as interrater agreement) refers to the degree to which two or more independent raters using the same scale provide the same rating of a particular person or objects in identical observable situations. However, consistency reliability (known as interater reliability) coefficient refers to the extent to which two or more raters provide the same relative ordering or ranking of the particular person or objects in identical situations. Both methods can be used to examine the rater similarity; however, it is important to recognize their differences.
Alternative to two standard procedure mentioned above, Many-facet Rasch Measurement (MFRM) developed by Linacre (1989). It is an extension of the Rasch model. The Rasch model, a one-parameter item response theory (IRT) model, has typically been used for analysis of multiple choice items. The model provides estimates of each examinee's ability and each item's difficulty and places them on an equal-interval log-linear scale. (Wright & Stone, 1979). The estimated parameters are sample independent; that is, estimates of person abilities are not dependent on the specific sample of items used, and the estimated item difficulties are likewise independent of the specific group of persons to which they were administered.
Both theoretical and psychometric issues remain unresolved for SET questionnaires. Studies have been accumulated around two main concerns which are in fact relevant with each other. The first concerns focus on whether students’ evaluations are valid and actually measure what we intent to measure so called teaching effectiveness. The second concern focus on the reliability and potential bias sources on our measures (Gursoy & Umbreit 2005).
Purposes;
- to examine internal structure (factorial structure) of the SET Questionnaire used,
- to estimate the reliability of the student ratings,
- to explore students’ judging behavior using the Many-Facet Rasch Measurement to examine any potential source of bias.
Method
Expected Outcomes
References
Cohen, E. H. (2005) Student evaluations of course and teacher: factor analysis and SSA approaches. Assessment & Evaluation in Higher Education, 30:2, 123-136 Eckes, T. (2005) Examining Rater Effects in TestDaF Writing and Speaking Performance Assessments: A Many-Facet Rasch Analysis. Language Assessment Quarterly, 2:3, 197-221 Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy Division. Gursoy, D. & Umbreit, W. T. (2005). Exploring Students’ Evaluations of Teaching Effectiveness: What Factors are Important? Journal of Hospitality & Tourism Research, 29: 91-109. Hoyt, W. T., & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4, 403–424. Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education/Praeger. LeBreton, M. & Senter, J. L. (2008). Answers to 20 Questions About Interrater Reliability and Interrater Agreement. Orgzational Research Methods 11(4), 815-852. Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan. Penny, A. R. (2003) Changing the Agenda for Research into Students' Views about University Teaching: Four shortcomings of SRT research, Teaching in Higher Education, 8:3,399-411. Seldin, P. (1993). The use and abuse of student ratings of professors. The Chronicle of Higher Education, 39(46), A40. Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598–642. Zabaleta, F. (2007). The use and misuse of student evaluation of teaching. Teaching in Higher Education, 12(1), 55-76. Wright B.D., & Linacre J.M., 1994. Chi-square fit statistics. Linacre Wright Rasch Measurement Transactions, 8(2),360.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.