Exploring Variability Sources in Student Evaluation of Teaching via Many-Facet Rasch Measurement

Author(s):

Bengu Borkan(presenting / submitting)

Conference:

ECER 2015

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 12 A, Theoretical and Methodological Issues in Tests and Assessments (Part 2)

Paper Session continues from 09 SES 08 A

Time:

2015-09-11

09:00-10:30

Room:

326. [Main]

Chair:

Jan-Eric Gustafsson

Contribution

Student Evaluation of Teaching (SET) which is usually used to learn students’ experiences with a course and its instructor performance at some point of semester seem to be to most universally used method of gathering data for supposedly both formative and summative evaluation purposes (Penny, 2003; Seldin, 1993; Zabaleta, 2007). SET is a performance assessment. Performance assessment typically requires person being evaluated to display performance and/or construct a response/product and, quality of this output is evaluated by at least one evaluator/rater. Therefore nature of performance assessment can be considered as subjective. “This long, and possibly fragile, interpretation–evaluation–scoring chain highlights the need to carefully investigate the psychometric quality of rater-mediated assessments. One of the major difficulties facing the researcher, and the practitioner alike, is the occurrence of rater variability.” (Eckes, T., 2009, p.4). The term ‘rater variability’ generally refers to variability that is associated with characteristics (ie. lenient, severe) of the raters and not with the performance of person being evaluated (Eckes, 2009). In other words, rater variability that produces constructs irrelevant variance in person scores threatens the validity, reliability and fairness of performance assessment (Lane & Stone, 2006; Messick, 1989). Needlessly to say inter-rater reliability should be as high as possible in such assessment process. Previous research in different settings show that significant rater effects exist in rater mediated performance assessment (Eckes, 2005).

Reliability

Although reliability is considered an important aspect of measurement, there is no consensus on the meaning of reliability and how to calculate the reliability coefficients. Besides, these problems become more visible when the scores come from a performance assessment. There are many procedure/statistics to report reliability of rater-mediated performance assessment results. The reliability coefficients may mean vastly different things. Consensus and consistency are the two broad category names for different reliability coefficients. Consensus reliability (known as interrater agreement) refers to the degree to which two or more independent raters using the same scale provide the same rating of a particular person or objects in identical observable situations. However, consistency reliability (known as interater reliability) coefficient refers to the extent to which two or more raters provide the same relative ordering or ranking of the particular person or objects in identical situations. Both methods can be used to examine the rater similarity; however, it is important to recognize their differences.

Alternative to two standard procedure mentioned above, Many-facet Rasch Measurement (MFRM) developed by Linacre (1989). It is an extension of the Rasch model. The Rasch model, a one-parameter item response theory (IRT) model, has typically been used for analysis of multiple choice items. The model provides estimates of each examinee's ability and each item's difficulty and places them on an equal-interval log-linear scale. (Wright & Stone, 1979). The estimated parameters are sample independent; that is, estimates of person abilities are not dependent on the specific sample of items used, and the estimated item difficulties are likewise independent of the specific group of persons to which they were administered.

Both theoretical and psychometric issues remain unresolved for SET questionnaires. Studies have been accumulated around two main concerns which are in fact relevant with each other. The first concerns focus on whether students’ evaluations are valid and actually measure what we intent to measure so called teaching effectiveness. The second concern focus on the reliability and potential bias sources on our measures (Gursoy & Umbreit 2005).

Purposes;

to examine internal structure (factorial structure) of the SET Questionnaire used,
to estimate the reliability of the student ratings,
to explore students’ judging behavior using the Many-Facet Rasch Measurement to examine any potential source of bias.

Method

This study will utilize student evaluation of teaching data collected in the undergraduate courses offered in Boğaziçi University in 2010/11 Spring Semester in four different faculties. 2010/11 Spring semester has the highest participation rate in online SET questionnaire since SET started to be applied online. SET questionnaire includes 15 items; five of them about the course and rest are about instructor. Each item has a 5 point rating scale. In addition to these items, students need to report their attendance level, GPA, expected grade for relevant course and whether they take the course as an elective or a required course. Four steps were followed during analysis of data in this study: Step I-identifying the dimensions of SCE questionnaire through the use of exploratory factor analysis in MPLUS. Step II-cross validating the dimensions of the questionnaire found in step I. Step III-calibrating examinee (instructor), item and raters estimates with FACET for each sample separately Step V-testing bias effect of various course characteristics The calibration was completed using the FACETS computer program (Linacre, 1994), in which the rating scale model was selected. FACETS calibrates the examinees, raters, and the rating scales onto the same equal-interval scale (the logit scale), creating a single frame of reference for interpreting the results of the analysis. In the last step, one more three more facets were added to the model used in previous step; class size, student expected grade, actual grade and course status (elective/required). However this step examines the bias effect; bias analysis for rater (student) by person (instructor), rater by class size, expected grade, grade, course type (elective/must) are examined by examining the standardized residuals. Bias analysis helps to identify unusual interaction patterns among facets, particularly those patterns that point to consistent deviations from what is expected on the basis of the model (Eckes, 2005).

Expected Outcomes

• Based on EFA and CFA, The questionnaire has one factor structure; that are contrary to the proposed two dimensional structure. Therefore (data can be analyzed with Rasch Models). • Overall data–Rasch model fit was assessed by examining the responses that are unexpected given the assumptions of the mode in other words, standardized residuals. The Outfit is a fit statistic that is sensitive to item or respondent outliers. Both statistics range from zero to positive infinity, where a value close to 1.0 indicates an adequate fit and values less than 0.5 (little variation in responses) and greater than 1.5 (large variation) indicate poor fit. In our analysis, less than 5% of standardized residuals are equal or greater than 2. So data–Rasch model fit is acceptable. • The person (instructor) separation index is 2.40, its reliability is up to 0.85, indicates that the assessment can separate instructors according to their level of Teaching quality with a moderate degree of confidence. • The rater separation index and reliability are 2.74 and 0.88 respectively, which means there is statistically significant difference among students (raters). For raters, a small value of reliability is desirable, since ideally different raters would be equally severe or lenient. The rater separation ratio, 2,74 means that the differences between rater severities are over almost three times greater than the error with which these severities are measured. The rater separation index of 4 suggests that there about 4 distinct strata of rater severity in student sample. So there is some evidence here of unwanted variation between raters in their levels of severity. • Bias analysis will be conducted later. An MFRM bias analysis can allow us to identify pattern in relation to different interaction effects of the facets; rater by expected grade, rater by earned grade, rater by required course

References

Cohen, E. H. (2005) Student evaluations of course and teacher: factor analysis and SSA approaches. Assessment & Evaluation in Higher Education, 30:2, 123-136 Eckes, T. (2005) Examining Rater Effects in TestDaF Writing and Speaking Performance Assessments: A Many-Facet Rasch Analysis. Language Assessment Quarterly, 2:3, 197-221 Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy Division. Gursoy, D. & Umbreit, W. T. (2005). Exploring Students’ Evaluations of Teaching Effectiveness: What Factors are Important? Journal of Hospitality & Tourism Research, 29: 91-109. Hoyt, W. T., & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4, 403–424. Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education/Praeger. LeBreton, M. & Senter, J. L. (2008). Answers to 20 Questions About Interrater Reliability and Interrater Agreement. Orgzational Research Methods 11(4), 815-852. Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan. Penny, A. R. (2003) Changing the Agenda for Research into Students' Views about University Teaching: Four shortcomings of SRT research, Teaching in Higher Education, 8:3,399-411. Seldin, P. (1993). The use and abuse of student ratings of professors. The Chronicle of Higher Education, 39(46), A40. Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598–642. Zabaleta, F. (2007). The use and misuse of student evaluation of teaching. Teaching in Higher Education, 12(1), 55-76. Wright B.D., & Linacre J.M., 1994. Chi-square fit statistics. Linacre Wright Rasch Measurement Transactions, 8(2),360.

Author Information

Bengu Borkan (presenting / submitting)

bogazici university

educational sciences

Istanbul

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.