Student ratings of teaching as elicited in surveys typical in higher education[1] are

not reliable, statistically;
invalid as measures of student learning (no researcher disputes this);
invalid even as measures of student satisfaction with courses and instructors;
not reliable as sources of even basic factual information about students, courses and instructors (e.g., whether the instructor was habitually late for class, whether there was a required textbook).[2]
typically misrepresented in summaries provided;
rarely checked for satisficing (e.g., straight-lining) other than skipping;
rarely checked for non-response bias;
rarely if ever checked for lying; and
readily gamed by instructors.

Despite cautions often given by universities about how to use the results of surveys, the cautions can be ignored in practice and those results are nonetheless typically at least one component of higher stakes employee assessments (hiring, firing, retention, promotion, merit pay). US EEOC User Guidelines for Employee Screening (reflecting the court cases on disparate impact)[3] may imply that the results of these surveys are not sufficiently trustworthy to be used legally in higher stakes employee assessments.[4]

Some small studies from the 1990s gave good reason to be concerned about the validity and reliability of student ratings (not that these were the only studies that should have prompted deep reservations):

N. Ambady and R. Rosenthal, “Half a minute: Predicting teacher evaluations from thin slices of behavior and physical attractiveness,” Journal of Personality and Social Psychology v64 (1993) 431–441. https://ambadylab.stanford.edu/pubs/1993Ambady.pdf

Abstract: The accuracy of strangers’ consensual judgments of personality based on “thin slices” of targets’ nonverbal behavior were examined in relation to an ecologically valid criterion variable. In the 1st study, consensual judgments of college teachers’ molar nonverbal behavior based on very brief (under 30 sec) silent video clips significantly predicted global end-of-semester student evaluations of teachers. In the 2nd study, similar judgments predicted a principal’s ratings of high school teachers. In the 3rd study, ratings of even thinner slices (6 and 15 sec clips) were strongly related to the criterion variables. Ratings of specific micrononverbal behaviors and ratings of teachers’ physical attractiveness were not as strongly related to the criterion variable. These findings have important implications for the areas of personality judgment, impression formation, and nonverbal behavior. [emphasis added]

Wendy M. Williams & Stephen J. Ceci, “How’m I doing? Problems with Student Ratings of Instructors and Courses,” Change: The Magazine of Higher Learning v29 n5 (Sept/Oct 1997) 12–23.

[Ceci taught two sections of a large psychology course in parallel, using the skills he learned in an acting class in one section but not the other. Both sections took the same tests and performed equally on them, but in the section in which he used his acting skills, he received much higher student ratings. He had taught the course for many years and was able to deliver the same content almost word for word in each section as verified from recordings of the lectures.]

Much of any debate about student ratings of teaching (and the effect of their use in promoting grade inflation) should have ended in 2003:

Valen E. Johnson, Grade inflation: a crisis in college education (Springer, 2003) LB2368 .J65 2003 ISBN: 038700125 https://catalog.lib.ncsu.edu/record/NCSU1644520

http://books.google.com/books?id=vFGe0eAhl4YC&printsec=frontcover&dq=valen +johnson&hl=en&sa=X&ei=1QGUUKuZONS80QHBloDoDQ&ved=0CC8Q6AEwAA 

Table of Contents

Acknowledgments (p. v)

1. Introduction (p. 1)

Summary (p. 13)

2. The DUET Experiment (p. 15)

DUET Survey Items (p. 19)

Appendix. Issues of Nonresponse (p. 27)

3. Grades and Student Evaluations of Teaching (p. 47)

Observational Studies (p. 51)

Experimental Studies (p. 73)

Summary (p. 81)

4. DUET Analysis of Grades and SETs (p. 83)

Intervening Variables and Student Evaluations of Teaching (p. 85)

Implications for Grading Theories (p. 94)

Causal Effects of Student Grades on SETs (p. 101)

Summary (p. 114)

Appendix (p. 118)

Standardization Procedures for Analyses of "Intervening Variables and SETs" (p. 118)

Effects of Nonresponse in Analyses of "Intervening Variables and SETs" (p. 120)

Regression Coefficients for Analyses of Causal Effects of Grades on SETs (p. 125)

5. Validity of SETs (p. 133)

SET Development (p. 139)

Toward Product Measures of Student Achievement (p. 150)

Conclusions (p. 164)

Appendix (p. 166)

6. Grades and Student Course Selection (p. 167)

Analysis of the DUET Course Selection Data (p. 172)

Effects of Sample Selection (p. 179)

Extension to Selection Decisions Between Academic Fields (p. 188)

Conclusions (p. 193)

7. Grading Equity (p. 195)

Differential Grading Standards (p. 197)

Methods for Grade Adjustment (p. 209)

Discussion (p. 224)

Appendix (p. 226)

Pairwise Differences in Grades at Duke University (a la Goldman and Widawski [GW76]) (p. 226)

Explanation of Achievement Index Adjustment Scheme (p. 226)

8. Conclusions (p. 233)

Reform (p. 239)

Bibliography (p. 247)

Index (p. 259)

a brief preview of which is in:

VE Johnson, “Teacher Course Evaluations and Student Grades: An Academic Tango,” Chance 15 (3) (2002) 9- 16. http://www.isds.duke.edu/~dalene/chance/chanceweb/153.johnson.pdf

But Johnson’s work was ignored or dismissed by most School of Education faculty and the Wall Street Journal dismissed its conclusions as so obviously correct as not needing support from research.[5]

Some large studies published in 2009 and 2010 renewed debate:

Bruce A. Weinberg, Masanori Hashimoto, and Belton M. Fleisher, “Evaluating Teaching in Higher Education,” Research in Economic Education v40 n3 (Summer 2009) 227-261.

Abstract: The authors develop an original measure of learning in higher education, based on grades in subsequent courses. Using this measure of learning, they show that student evaluations are positively related to current grades but unrelated to learning once current grades are controlled. They offer evidence that the weak relationship between learning and student evaluations arises, in part, because students are unaware of how much they have learned in a course. They conclude with a discussion of easily implemented, optimal methods for evaluating teaching.

“Our data cover more than 45,000 enrollments in almost 400 offerings of Principles of Microeconomics, Principles of Macroeconomics, and Intermediate Microeconomics over a decade at The Ohio State University. In addition to information on student evaluations, the data contain all grades that students received in subsequent economics courses and rich information on student background, including race, gender, ethnicity, high school class rank, and SAT or ACT scores (or both).“

“Summary

The following are highlights of what we found:

1. There is a consistent positive relationship between grades in the current course and evaluations. This finding is robust to the inclusion of a wide range of controls and specifications.

2. There is no evidence of an association between learning and evaluations controlling for current course grades.

3. Learning is no more related to student evaluations of the amount learned in the course than it is to student evaluations of other aspects of the course.

4. In some cases women and foreign-born instructors receive lower evaluations than other instructors, all else being equal.”[6]

Dennis E. Clayson, long a “voice in the wilderness” as a critic of student ratings, published:

DE Clayson, “Student Evaluation of Teaching: Are They Related to What Students Learn? A Meta-Analysis and Review of the Literature,” Journal of Marketing Education, 31(1) (2009) 16-30

-which found Johnson (2003) and Weinberg et al (2009) to be the only two studies of many that provided useful information.

Because of the large population and the unusually tight curricular controls enforced at the US Air Force Academy,

SE Carrell and JE West, “Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors,” Journal of Political Economy, 118 (2010) 409-432.

Abstract: In primary and secondary education, measures of teacher quality are often based on contemporaneous student performance on standardized achievement tests. In the postsecondary environment, scores on student evaluations of professors are typically used to measure teaching quality. We possess unique data that allow us to measure relative student performance in mandatory follow-on classes. We compare metrics that capture these three different notions of instructional quality and present evidence that professors who excel at promoting contemporaneous student achievement teach in ways that improve their student evaluations but harm the follow-on achievement of their students in more advanced classes.

amplified criticism of student ratings.

In 2007, Stephen R. Porter http://www.stephenPorter.org/ trained as a political scientist in a quantitatively-oriented graduate program, began to raise serious questions about the trustworthiness of any student survey results.[7] The results of his research were published during 2011-2012:

Stephen R. Porter, “Do College Student Surveys Have Any Validity?” The Review of Higher Education v5 n1 (Fall 2011) 45–76

“In this article, I argue that the typical college student survey question has minimal validity and that our field requires an ambitious research program to reestablish the foundation of quantitative research on students. Our surveys lack validity because (a) they assume that college students can easily report information about their behaviors and attitudes, when the standard model of human cognition and survey response clearly suggests they cannot, (b) existing research using college students suggests they have problems correctly answering even simple questions about factual information, and (c) much of the evidence that higher education scholars cite as evidence of validity and reliability actually demonstrates the opposite. I choose the National Survey of Student Engagement (NSSE) for my critical examination of college student survey validity ….”[8]

Stephen R. Porter, Corey Rumann, and Jason Pontius, “The Validity of Student Engagement Survey Questions: Can We Accurately Measure Academic Challenge?” New Directions for Institutional Research, n150 (Summer 2011) 87- 98 DOI: 10.1002/ir.391

This chapter examines the validity of several questions about academic challenge taken from the National Survey of Student Engagement. We compare student self-reports about the number of books assigned to the same number derived from course syllabi, finding little relationship between the two measures.

Stephen R. Porter, “Self-Reported Learning Gains: A Theory and Test of College Student Survey Response,” Research in Higher Education (Nov 2012) DOI 10.1007/s11162-012-9277-0

Abstract: Recent studies have asserted that self-reported learning gains (SRLG) are valid measures of learning, because gains in specific content areas vary across academic disciplines as theoretically predicted. In contrast, other studies find no relationship between actual and self-reported gains in learning, calling into question the validity of SRLG. I reconcile these two divergent sets of literature by proposing a theory of college student survey response that relies on the belief-sampling model of attitude formation. This theoretical approach demonstrates how students can easily construct answers to SRLG questions that will result in theoretically consistent differences in gains across academic majors, while at the same time lacking the cognitive ability to accurately report their actual learning gains. Four predictions from the theory are tested, using data from the 2006–2009 Wabash National Study. Contrary to previous research, I find little evidence as to the construct and criterion validity of SRLG questions.

Stephen R. Porter “Using Student Learning as a Measure of Quality in Higher Education,” (2012) “The purpose of this paper is to review existing measures of student learning, and to explore their strengths and weaknesses as a quality metric for higher education.” http://www.hcmstrategists.com/contextforsuccess/papers/PORTER_PAPER.pdf

The objective of the Context for Success project was to ask scholars of higher education to weigh in on the issues – both theoretical and practical – that need to be considered in designing “input-adjusted metrics” for judging the effectiveness of postsecondary institutions. With the support of the Bill & Melinda Gates Foundation, the consulting firm HCM Strategists invited a number of scholars from around the country to write papers that would discuss the methodological issues in accounting for differences in student populations when evaluating institutional performance. In some cases, these authors were also asked to demonstrate the effects of such adjustments using actual data. http://www.hcmstrategists.com/contextforsuccess/papers.html

[In the latter paper, Porter makes a point seldom remarked upon: almost all of the validity research on the self-reported behaviors, self-reported learning gains and direct measures of student learning that he discusses (e.g., the CLA, relied on in the methodologically flawed Arum and Roska, Academically Adrift: Limited Learning on College Campuses[9]) has

been conducted by researchers heavily involved with the organizations that are designing, marketing and administering these surveys and tests. … Unlike the field of medicine, which has openly struggled with issues surrounding research funded by drug companies and doctors recommending procedures using medical devices created by their own companies, the field of postsecondary research has largely ignored this topic.[10]]

Porter’s conclusions about student surveys receive strong support in a recent doctoral dissertation:

William R. Standish, III, A Validation Study of Self-Reported Behavior: Can College Student Self-Reports of Behavior Be Accepted as Being Self-Evident? (2017)

https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33607/etd.pdf?sequence=1

Abstract excerpt: This validation study of self-reported behaviors compares institution-reported, transactional data to corresponding self-reported academic performance, class attendance, and co-curricular participation from a sample of 6,000 students, using the Model of the Response Process by Tourangeau (1984, 1987). Response bias, observed as measurement error, is significant in 11 of the 13 questions asked and evaluated in this study. Socially desirable behaviors include campus recreation facility (CRF) use and academic success being overstated as much as three times. Nonresponse bias, observed as nonresponse error, is also significant in 11 of the same 13 questions asked and evaluated with high GPA and participatory students over represented in the survey statistic. For most of the questions, measurement error and nonresponse error combine to misstate behavior by at least 20%. The behaviors most affected are CRF use, which is overstated by 112% to 248%; semester GPA self-reports of 3.36 versus an actual value of 3.04; and co-curricular participation that misstated by between -21% to +46%. This validation study sufficiently demonstrates that measurement error and nonresponse error are present in the self-reported data collected for the commonly studied topics in higher education that were represented by the 13 questions. Researchers using self-reported data cannot presume the survey statistic to be an unbiased estimate of actual behavior that it is generalizable to larger populations.[11]

Nor is Porter the only one who has highlighted problems with student surveys:

NA Bowman, “Can 1st-Year College Students Accurately Report Their Learning and Development?,” American Educational Research Journal, v47 n2 (2010) 466-496.

Abstract: Many higher education studies use self-reported gains as indicators of college student learning and development. However, the evidence regarding the validity of these indicators is quite mixed. It is proposed that the temporal nature of the assessment—whether students are asked to report their current attributes or how their attributes have changed over time—best accounts for students’ (in)ability to make accurate judgments. Using a longitudinal sample of over 3,000 first-year college students, this study compares self-reported gains and longitudinal gains that are measured either objectively or subjectively. Across several cognitive and noncognitive outcomes, the correlations between self-reported and longitudinal gains are small or virtually zero, and regression analyses using these two forms of assessment yield divergent results.[12]

The strongest correlations ever found between student learning and student ratings in a methodologically sound study are in the range 0.18 – 0.25:

Trinidad Beleche, David Fairris, Mindy Marks, “Do course evaluations truly reflect student learning? Evidence from an objectively graded post-test,” Economics of Education Review 31 (2012) 709–719. http://dx.doi.org/10.1016/j.econedurev.2012.05.001

Abstract: It is difficult to assess the extent to which course evaluations reflect how much students truly learn from a course because valid measures of learning are rarely available. This paper makes use of a unique setting in which students take a common, high-stakes post-test which is centrally graded and serves as the basis for capturing actual student learning. We match these student-specific measures of learning to student-specific course evaluation scores from electronic records and a rich set of student-level covariates, including a pre-test score and other measures of skills prior to entering the course. While small in magnitude, we find a robust positive, and statistically significant, association between our measure of student learning and course evaluations. [13]

The following seven survey items are listed in decreasing order of (weak) correlations (0.25 – 0.18)[14] with student learning (as measured in a uniform, high stakes final exam):

Supplementary materials (e.g. films, slides, videos, guest lectures, web pages, etc.) were informative. [strongest correlation = 0.25]
Instructor was clear and understandable. 
Instructor used class time effectively.
Instructor was effective as a teacher overall. 
Instructor was prepared and organized. 
Instructor was fair in evaluating students. 
The course overall as a learning experience was excellent. [weakest correlation = 0.18]

Nilson comments:

[Beleche et al] … uncovered a weak relationship between student ratings and learning, but the learning measured was not of the long-term type. Rather it was based on students’ scores on a high-stakes final exam administered across multiple sections.[15]

Herbert Marsh, regarded by many School of Education faculty as a, if not the, leading researcher on student ratings, repeatedly commits an elementary statistical error in his work, as argued in:

Donald D. Morley, “Claims about the reliability of student evaluations of instruction: The ecological fallacy rides again,” Studies in Educational Evaluation v38 (2012) 15–20

Abstract: The vast majority of the research on student evaluation of instruction has assessed the reliability of groups of courses and yielded either a single reliability coefficient for the entire group, or grouped reliability coefficients for each student evaluation of teaching (SET) item. This manuscript argues that these practices constitute a form of ecological correlation and therefore yield incorrect estimates of reliability. Intraclass reliability and agreement coefficients were proposed as appropriate for making statements about the reliability of SETs in specific classes. An analysis of 1073 course sections using inter-rater coefficients found that students using this particular instrument were generally unable to reliably evaluate faculty. In contrast, the traditional ecologically flawed multi-class ‘‘group’’ reliability coefficients had generally acceptable reliability.

Philip Stark [16] provides a pellucid presentation of many of the myriad problems with student ratings and the ways in which they are (mis)used:

Philip B. Stark & Richard Freishtat Evaluating Evaluations: Part I (Oct 9, 2013)

http://blogs.berkeley.edu/2013/10/14/do-student-evaluations-measure-teaching-effectiveness/

Philip B. Stark & Richard Freishtat, What Evaluations Measure: Part II (Oct 17, 2013)

http://teaching.berkeley.edu/blog/what-evaluations-measure-part-ii[17]

[Like Stark, Boysen et al emphasizes that ordinal scales (e.g., Likert scales) should not be treated as interval scales.

Boysen, G. A., Kelly, T. J., Raesly, H. N., & Casner, R. W. (2014). The (mis) interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39, 641– 656. http://dx.doi.org/10.1080/02602938 .2013.860950

Norman argues that little harm is done by treating ordinal scales as interval:

Geoff Norman, “Likert scales, levels of measurement and the ‘laws’ of statistics,” Advances in Health Sciences Education, Volume 15, Number 5 (December 2010) 625- 632. http://dx.doi.org/10.1007/s10459-010-9222-y.

But Fayers points out that caution is on the contrary necessary:

Peter Fayers, “Alphas, betas and skewy distributions: two ways of getting the wrong answer,” Advances in Health Sciences Education, Volume 16 (March 2011) 291- 296 DOI 10.1007/s10459-011-9283-6. “Abstract: Although many parametric statistical tests are considered to be robust, as recently shown [by Geoff Norman], it still pays to be circumspect about the assumptions underlying statistical tests. In this paper I show that robustness mainly refers to a, the type-I error. If the underlying distribution of data is ignored there can be a major penalty in terms of the b, the type-II error, representing a large increase in false negative rate or, equivalently, a severe loss of power of the test.” “… type-II errors can be substantially increased if non-normality is ignored.” (292)

Liddell and Kruschke explain why even more kinds of error than Fayers identifies are common:

Torrin M. Liddell and John K. Kruschke, “Analyzing ordinal data with metric models: What could possibly go wrong?” Journal of Experimental Social Psychology 79 (2018) 328-348.

Abstract: We surveyed all articles in the Journal of Personality and Social Psychology (JPSP), Psychological Science (PS), and the Journal of Experimental Psychology: General (JEP:G) that mentioned the term “Likert,” and found that 100% of the articles that analyzed ordinal data did so using a metric model. We present novel evidence that analyzing ordinal data as if they were metric can systematically lead to errors. We demonstrate false alarms (i.e., detecting an effect where none exists, Type I errors) and failures to detect effects (i.e., loss of power, Type II errors). We demonstrate systematic inversions of effects, for which treating ordinal data as metric indicates the opposite ordering of means than the true ordering of means. We show the same problems — false alarms, misses, and inversions — for interactions in factorial designs and for trend analyses in regression. We demonstrate that averaging across multiple ordinal measurements does not solve or even ameliorate these problems. A central contribution is a graphical explanation of how and when the misrepresentations occur. Moreover, we point out that there is no sure-fire way to detect these problems by treating the ordinal values as metric, and instead we advocate use of ordered-probit models (or similar) because they will better describe the data. Finally, although frequentist approaches to some ordered-probit models are available, we use Bayesian methods because of their flexibility in specifying models and their richness and accuracy in providing parameter estimates. An R script is provided for running an analysis that compares ordered-probit and metric models.]

Uttl et al provides a better, far more comprehensive meta-analysis than Clayson did in 2009, and removes the support that older studies may have seemed to provide for the use of student ratings (studies on which defenders of student ratings typically relied):

Bob Uttl, Carmela A. White, Daniela Wong Gonzalez, “Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related,” Studies in Educational Evaluation 54 (2017) 22–42. doi:10.1016/j.stueduc.2016.08.007

Abstract: Student evaluation of teaching (SET) ratings are used to evaluate faculty’s teaching effectiveness based on a widespread belief that students learn more from highly rated professors. The key evidence cited in support of this belief are meta-analyses of multisection studies showing small-to-moderate correlations between SET ratings and student achievement (e.g., Cohen, 1980, 1981; Feldman, 1989). We re-analyzed previously published meta-analyses of the multisection studies and found that their findings were an artifact of small sample sized studies and publication bias. Whereas the small sample sized studies showed large and moderate correlation, the large sample sized studies showed no or only minimal correlation between SET ratings and learning. Our up-to-date meta-analysis of all multisection studies revealed no significant correlations between the SET ratings and learning. These findings suggest that institutions focused on student learning and career success may want to abandon SET ratings as a measure of faculty’s teaching effectiveness.

Along with Uttl et al,[18] and the other research cited above, the following study should result in significant change in the use of student ratings of instructors, if only to avoid EEOC complaints and law suits:[19]

Boring, A., Ottoboni, K., & Stark, P. B. (2016) “Student Evaluation of Teaching (Mostly) Do Not Measure Teaching Effectiveness.” ScienceOpen https://www.scienceopen.com/document/vid/818d8ec0-5908-47d8-86b4-5dc38f04b23e (57,464 views to date in late August 2018)

Abstract: Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show:

1. SET are biased against female instructors by an amount that is large and statistically significant  

2. the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded  

3. the bias varies by discipline and by student gender, among other things  

4. it is not possible to adjust for the bias, because it depends on so many  factors  

5. SET are more sensitive to students’ gender bias and grade expectations than they are to teaching effectiveness  

6. gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors  These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.  

The authors conclude

In the US, SET have two primary uses: instructional improvement and personnel decisions, including hiring, firing, and promoting instructors. We recommend caution in the first use, and discontinuing the second use, given the strong student biases that influence SET.

Overall, SET disadvantages female instructors. There is no evidence that this is the exception rather than the rule. Hence, the onus should be on universities that rely on SET for employment decisions to provide convincing affirmative evidence that such reliance does not have disparate impact on women, underrepresented minorities, or other protected groups. Absent such specific evidence, SET should not be used for personnel decisions.[20]

On bias, see also:

Kristina M. W. Mitchell &Jonathan Martin, “Gender Bias in Student Evaluations,” The Teacher (American Political Science Association, 2018) 1-5 doi:10.1017/S104909651800001X Supplement:

https://static.cambridge.org/resource/id/urn:cambridge.org:id:binary:20180307101832591-0990:S104909651800001X:S104909651800001Xsup001.pdf

Abstract: Many universities use student evaluations of teachers (SETs) as part of consideration for tenure, compensation, and other employment decisions. However, in doing so, they may be engaging in discriminatory practices against female academics. This study further explores the relationship between gender and SETs described by MacNell, Driscoll, and Hunt (2015) [see below] by using both content analysis in student-evaluation comments and quantitative analysis of students’ ordinal scoring of their instructors. The authors show that the language students use in evaluations regarding male professors is significantly different than language used in evaluating female professors. They also show that a male instructor administering an identical online course as a female instructor receives higher ordinal scores in teaching evaluations, even when questions are not instructor-specific. Findings suggest that the relationship between gender and teaching evaluations may indicate that the use of evaluations in employment decisions is discriminatory against women.

Anne Boring, “Gender Biases in Student Evaluations of Teaching,” Journal of Public Economics v145 (2017) 27–41.

This article uses data from a French university to analyze gender biases in student evaluations of teaching (SETs). The results of fixed effects and generalized ordered logit regression analyses show that male students express a bias in favor of male professors. Also, the different teaching dimensions that students value in male and female professors tend to match gender stereotypes. Men are perceived by both male and female students as being more knowledgeable and having stronger class leadership skills (which are stereotypically associated with males), despite the fact that students appear to learn as much from women as from men.

“The database includes a total of 20,197 observations: 11,522 evaluations by female students and 8675 evaluations by male students. Evaluations are obtained from 4362 different students (57% female students and 43% male students) and 359 different professors (33% women and 67% men) for a total of 1050 seminars.”

Friederike Mengel, Jan Sauermann, Ulf Zolitz, “Gender Bias in Teaching Evaluations” IZA DP No. 11000 September 2017 http://ftp.iza.org/dp11000.pdf

Abstract: This paper provides new evidence on gender bias in teaching evaluations. We exploit a quasi-experimental dataset on 19,952 student evaluations of university faculty in a context where students are randomly allocated to female or male teachers. Despite the fact that neither students’ grades nor self-study hours are affected by the teacher’s gender, we find that in particular male students evaluate female teachers worse than male teachers. The bias is largest for junior teachers, which is worrying since their lower evaluations might affect junior women’s confidence and hence have direct as well as indirect effects on women’s progression into academic careers.

Natascha Wagner, Matthias Rieger, Katherine Voorvelt, “Gender, ethnicity and teaching evaluations Evidence from mixed teaching teams,” Working Paper No. 617 The International Institute of Social Studies (March 2016) https://repub.eur.nl/pub/79869/wp617.pdf and Economics of Education Review 54 (2016) 79–94.

Abstract: This paper studies the effect of teacher gender and ethnicity on student evaluations of teaching at university. We analyze a unique data-set featuring mixed teaching teams and a diverse, multicultural, multi-ethnic group of students and teachers. Blended co-teaching allows us to study the link between student evaluations of teaching and teacher gender as well as ethnicity exploiting within course variation in a panel data model with course-year fixed effects. We document a negative effect of being a female teacher on student evaluations of teaching, which amounts to roughly one fourth of the sample standard deviation of teaching scores. Overall women are 11 percentage points less likely to attain the teaching evaluation cut-off for promotion to associate professor compared to men. The effect is robust to a host of co-variates such as course leadership, teacher experience and research quality, as well as an alternative teacher fixed effect specification. There is no evidence of a corresponding ethnicity effect. Our results are suggestive of a gender bias against female teachers and indicate that the use of teaching evaluations in hiring and promotion decisions may put female lectures at a disadvantage.

Daniel Storage, Zachary Horne, Andrei Cimpian, Sarah-Jane Leslie, “The Frequency of “Brilliant” and “Genius” in Teaching Evaluations Predicts the Representation of Women and African Americans across Fields,” PLOS ONE DOI:10.1371/journal.pone.0150194 (March 3, 2016)

Abstract: Women and African Americans—groups targeted by negative stereotypes about their intellectual abilities—may be underrepresented in careers that prize brilliance and genius. A recent nationwide survey of academics provided initial support for this possibility. Fields whose practitioners believed that natural talent is crucial for success had fewer female and African American PhDs. The present study seeks to replicate this initial finding with a different, and arguably more naturalistic, measure of the extent to which brilliance and genius are prized within a field. Specifically, we measured field-by-field variability in the emphasis on these intellectual qualities by tallying—with the use of a recently released online tool—the frequency of the words “brilliant” and “genius” in over 14 million reviews on RateMyProfessors.com, a popular website where students can write anonymous evaluations of their instructors. This simple word count predicted both women’s and African Americans’ representation across the academic spectrum. That is, we found that fields in which the words “brilliant” and “genius” were used more frequently on RateMyProfessors.com also had fewer female and African American PhDs. Looking at an earlier stage in students’ educational careers, we found that brilliance-focused fields also had fewer women and African Americans obtaining bachelor’s degrees. These relationships held even when accounting for field-specific averages on standardized mathematics assessments, as well as several competing hypotheses concerning group differences in representation. The fact that this naturalistic measure of a field’s focus on brilliance predicted the magnitude of its gender and race gaps speaks to the tight link between ability beliefs and diversity.

Lillian MacNell, Adam Driscoll, Andrea N. Hunt, “What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching,” Innovative Higher Education v40 n4 (August 2015) 291–303.

Student ratings of teaching play a significant role in career outcomes for higher education instructors. Although instructor gender has been shown to play an important role in influencing student ratings, the extent and nature of that role remains contested. While difficult to separate gender from teaching practices in person, it is possible to disguise an instructor’s gender identity online. In our experiment, assistant instructors in an online class each operated under two different gender identities. Students rated the male identity significantly higher than the female identity, regardless of the instructor’s actual gender, demonstrating gender bias. Given the vital role that student ratings play in academic career trajectories, this finding warrants considerable attention.[21]

These slightly older studies also find some evidence of gender bias:

S. A. Basow, S. Codos and J. L. Martin, “The Effects of Professors’ Race and Gender on Student Evaluations and Performance.” College Student v47 (2013) 352–363.

Abstract: This experimental study examined the effects of professor gender, professor race, and student gender on student ratings of teaching effectiveness and amount learned. After watching a three-minute engineering lecture presented by a computer-animated professor who varied by gender and race (African American, White), female and male undergraduates ( N = 325) completed a 26-question student evaluation form and a 10-question true /false quiz on the lecture content. Contrary to predictions, male students gave significantly higher ratings than female students on most teaching factors and African American professors were rated higher than White professors on their hypothetical interactions with students. Quiz results, however, supported predictions: higher scores were obtained by students who had a White professor compared to those who had an African American professor, and by students who had a male professor compared to those who had a female professor. These results may be due to students paying more attention to the more normative professor. Thus, performance measures may be a more sensitive indication of race and gender biases than student ratings. The limited relationship between student ratings and student learning suggests caution in using the former to assess the latter.

Lisa L. Martin Gender, “Teaching Evaluations, and Professional Success in Political Science,” American Political Science Annual Meeting, August 29-September 1, 2013.

“In this paper I make a number of interrelated arguments. First, a review of the psychological literature on gender and leadership assessments suggests that there is an interaction between gender and student assumptions about leadership roles. Thus, when a course requires that a teacher take on a stereotypical leader role – such as a large lecture course or a Massive Open Online Course (MOOC) – assumptions about gender roles could have a significant impact on evaluations. Second, I provide a preliminary assessment of this hypothesis using publicly – available SET data from a political science department at a large public university. These data suggest, as expected, that female faculty receive lower evaluations of general teaching effectiveness in large courses than do male faculty, while no substantial difference exists for small courses. To the extent that teaching evaluations are an important part of promotion and compensation decisions and other reward systems within universities, reliance on SETs that may be biased creates concerns. Third, the race by universities to join the MOOC game so far has exhibited a strong preference for courses taught by male faculty. All of these concerns suggest that the discipline needs to reconsider its methods of faculty evaluation and the role that such evaluations play in professional advancement.”

Satish Nargundkar and Milind Shrikhande, ”Norming of Student Evaluations of Instruction: Impact of Noninstructional Factors,” Decision Sciences Journal of Innovative Education v12 n1 (January 2014) 55-72.

Abstract: Student Evaluations of Instruction (SEIs) from about 6,000 sections over 4 years representing over 100,000 students at the college of business at a large public university are analyzed, to study the impact of noninstructional factors on student ratings. Administrative factors like semester, time of day, location, and instructor attributes like gender and rank are studied. The combined impact of all the noninstructional factors studied is statistically significant. Our study has practical implications for administrators who use SEIs to evaluate faculty performance. SEI scores reflect some inherent biases due to noninstructional factors. Appropriate norming procedures can compensate for such biases, ensuring fair evaluations.

Despite the relatively small size of the populations involved, both of the following studies highlight disturbing possibilities:

M. Oliver-Hoyo “Two Groups in the Same Class: Different Grades.” Journal of College Science Teaching, 38(1) (2008) 37-39. [-recounts the experience of one award-winning instructor who taught two sections in the same classroom at the same time and received significantly different evaluations from the two sections. One of the matters about which there was disagreement was how available the instructor was outside of class. All students were informed in the same way about the instructor’s office hours, she kept them, and many students made use of them. There was a significant association between receiving lower grades and rating the instructor as significantly less available during office hours.] http://www.ncsu.edu/chemistry/people/moh.html

Robert J. Youmans and Benjamin D. Jee, “Fudging the Numbers: Distributing Chocolate Influences Student Evaluations of an Undergraduate Course,” Teaching of Psychology v34 n4 (2007) 245-247.

Abstract: Student evaluations provide important information about teaching effectiveness. Research has shown that student evaluations can be mediated by unintended aspects of a course. In this study, we examined whether an event un- related to a course would increase student evaluations. Six discussion sections completed course evaluations administered by an independent experimenter. The experimenter offered chocolate to 3 sections [immediately] before they completed the evaluations. Overall, students offered chocolate gave more positive evaluations than students not offered chocolate. This result highlights the need to standardize evaluation procedures to control for the influence of external factors on student evaluations.[22]

A useful overview is at Dennis E. Clayson’s web site:

Student Evaluation of Teaching, https://business.uni.edu/clayson/set/

A Multi-Disciplined Review of the Student Teacher Evaluation Process (updated April 2017) https://business.uni.edu/clayson/Ext/SETSummaryJan2017.doc

Some of the very many masters and doctoral dissertations:

Verena Sylvia Bonitz, Student evaluation of teaching: Individual differences and bias effects 2011

Andamlak Terkik, Analyzing Gender Bias in Student Evaluations 2016

Erica L. DeFrain, An Analysis of Differences in Non-Instructional Factors Affecting Teacher-Course Evaluations Over Time and Across Disciplines 2016

Zacharia J. Varughese, The influence of teacher gender on college student motivation and engagement in an online environment 2017

Kenneth Ancell and Emily Wu, Teaching, Learning, and Achievement: Are Course Evaluations Valid Measures of Instructional Quality at the University of Oregon? 2017

Edgar Andrés Valencia Acuña, Response Styles in Student Evaluation of Teaching (2017)

To refute a likely suggestion about what to do instead:

Ronald A. Berk, “Should Student Outcomes Be Used to Evaluate Teaching?” The Journal of Faculty Development v28 n2, (May 2014) 87-96. https://search.proquest.com/docview/1673849348 and http://www.ronberk.com/articles/2014_outcomes.pdf

Factory Worker–Instructor Productivity Analogy

If you are considering student outcomes, student achievement or growth is the measure of teaching effectiveness; that is, it is outcome based. If a factory worker’s performance can be measured by the number of wickets (Remember: World Wide Wicket Company in “How to Succeed in Business without Really Trying!”) he or she produces over a given period of time, why not evaluate an instructor’s productivity by his or her students’ success on outcome measures? …

The arguments for this factory worker–instructor productivity analogy are derived from the principles of a piece-rate compensation system …. Piece-rate contracts are the most common form of “payment by results”…. These contracts provide a strong incentive for workers to produce, because high productivity results in immediate rewards, possibly even a decent minimum wage and healthcare benefits.

When this “contract” concept is applied to teaching, it disintegrates for three reasons:

1. A factory worker uses the same materials, such as plywood and chewing gum, to make each wicket. Instructors work with students whose characteristics vary considerably within each class and from course to course.  

2. The characteristics of a factory worker’s materials rarely influence his or her skills and rate of production; that is, the quality and quantity of wicket production can be attributed solely to the worker. Instructors have no control over the individual differences and key characteristics of their students, such as ability, attitude, motivation, age, gender, ethnicity, cholesterol, and blood glucose, and of their courses, such as class size, composition, classroom facilities, available technology, and class climate. These characteristics can affect students’ performance regardless of how well an instructor teaches.  

3. The production of wickets is easy to measure. Just count them. Measuring students’ performance on different outcomes is considerably more complicated with significant challenges to obtaining adequate degrees of reliability and validity for the scores. Then one has to pinpoint the component in the scores that is attributable to the instructor’s teaching.  

Consequently, the factory worker analogy just doesn’t stick. It’s like Teflon® to instructor evaluation. Student outcomes provide a patina of credibility as a measure of teaching rather than an authentic source of evidence. …

Non-response Bias:

Aside from general questions about the validity and reliability of ClassEval surveys, non-response bias is a serious concern. See, for example:

Ronald A. Berk, “Top 20 Strategies to Increase the Online Response Rates of Student Rating Scales,” International Journal of Technology in Teaching and Learning, 8(2) (2012) 98-107.

The problem with low response rates is that they provide an inadequate data base from which to infer teaching effectiveness from the scores on a student rating scale as well as other measures. If the percentage of responses is too small, the sampling error can be frightfully large and the representativeness of the student responses can be biased. The nonresponse bias also becomes a concern. The error (reliability) and biases (validity) significantly diminish the usefulness of the ratings and make administrators unhappy. Those psychometric deficiencies can undermine the evaluation process. Although the minimum response rate based on sampling error for a seminar with 10 students may be different from a class with 50, 100, or larger, rates in the 80–100% range will be adequate for most any class size. Statistical tables of response rates for different errors and confidence intervals are available (Nulty, 2008). … Unfortunately, the rules of survey sampling do not provide a simple statistical answer to the response rate question for online rating scales. The class (sample) size that responds in relation to the class (population) size is not the only issue. There are at least two major sources of error (or unreliability) to consider: (1) standard error of the mean rating based on sample size and (2) standard error of measurement based on the reliability of the item, subscale, or total scale ratings. Confidence intervals can be computed for both. In typical survey research, inferences about characteristics of the population are drawn from the sample statistics. Only decisions about groups are rendered; not about individuals. In contrast, the inferences from sample (class) ratings are used for teaching improvement (formative) and important career (summative) decisions about individual professors. The response rate for one type of decision may not be adequate for other types of decisions … .

Duncan D. Nulty, “The adequacy of response rates to online and paper surveys: What can be done?” Assessment & Evaluation in Higher Education, 33(3) (June 2008) 301–314.

Starting with the data from the liberal conditions, … for class sizes below 20 the response rate required needs to be above 58%. … (47%) is only adequate when class sizes are above … 30 … . … class size … needs to exceed 100 before its existing response rate of 20% can be considered adequate. When … more traditional and conservative conditions are set, … response rate … [of] 65% is only adequate when the class size exceeds approximately 500 students. The … response rate [of] 47% [is] only adequate for class sizes above 750 students. The 20% response rate … would not be adequate even with class sizes of 2000 students.

Nulty is the most widely cited source on this matter (with a Google Scholar citation count 1489 at present). NC State-wide, response rates are currently around 45% on average. A study (conducted at NC State; see Standish 2017, cited above) found significant, often large non-response biases in student surveys that was in part topically-driven:

Trey Standish and Paul D. Umbach, “Should We Be Concerned About Nonresponse Bias in College Student Surveys? Evidence of Bias from a Validation Study,” Research in Higher Education (2018) https://doi.org/10.1007/s11162-018-9530-2

Abstract : This study uses college student survey data and corresponding administrative data on campus recreation facility usage, academic performance, physical education class attendance, and co-curricular participation to examine nonresponse bias in college student surveys. Within the context of the Groves (Public Opin Q 70:646–675, 2006) Alternative Cause Model, we found compelling evidence of the presence of nonresponse error observed as student characteristics related to the survey topic that also explain their response propensity. An individual’s survey response propensity has a statistically significant relationship with their actual behavior for 2 of 3 survey topics. In 11 of the 13 survey questions used to measure the survey topic behaviors, we found statistically significant differences between the respondent and nonrespondent behavioral measures. These findings hold important implications for survey researchers and those using student surveys for high-stakes accountability measures because survey summary statistics may not be generalizable to the target population.

General Comments on Student Ratings Research:

One of the oddest things about the vast literature on student ratings of teaching is that almost all of the survey results’ defects are the result of widely acknowledged human psychological tendencies, yet the defenders of the surveys’ use seem to ‘set aside’ the influence of the tendencies, as if they had forgotten that the responses came from human beings. In no particular order, here are eleven of the many familiar tendencies in human behavior that have been shown to (indeed, could hardly fail to) influence student ratings. There are some overlaps among the eleven but all are worth mentioning.

If you give me a better rating, I’ll give you a better rating. (Better grades buy better ratings. The “reciprocity effect.”)
If you’re in charge then I’ll hold you responsible for what goes wrong while you’re in charge. (Characteristics of a classroom or online environment that are annoying might as well be the instructor’s fault. A noisy ventilation system or a poor software interface, for example, will irritate students and this irritation will leak into ratings, even if the students say and know that the problem was not the instructor’s fault. Adding items to the surveys to try to control for such factors won’t plug the leaks and will interact with 4., below.)
If you are a man/woman, then you should not disappoint my expectation that you’ll behave in the way that I think men/women are supposed to behave. (A female instructor who is more challenging and less warmly supportive may be seen as not doing her job and is regarded as deserving lower ratings.)
If I’m asked to fill out lots of surveys when I’m very busy, then I’m entitled to blow some of them off. (Students engage in satisficing behavior in completing surveys, are less than careful in reporting instructor behavior and may even confuse one course with another, leading them to, e.g., rate non-existent instructors.)[23]
If you didn’t rate [grade] me as I wanted to be or caused me some discomfort, then it’s OK for me to lie about what you did to retaliate against you. (Students lie in completing surveys and believe that they are justified in doing so. Although universities rightly exhibit concern about academic integrity when it comes to graded work, the same kind of concern about student survey results is rarely if ever exhibited.[24])
If you are conventionally attractive, then I’ll find you more likeable and smarter. (Conventionally attractive instructors receive higher ratings. If however you are very attractive, then your students might resent that, depending on their genders and yours.)
If you are more entertaining, then I’m sure that I’ll learn better from you. (Becoming better at acting without changing course content or exams yields much higher ratings without improving exam performance. There is a moderate positive correlation between student impressions of instructors based on viewing a 30 second video clip with no audio and end-of-term ratings of the instructors.)
If you’re a woman or minority (and I’m not?), then I’m biased against you. (This is distinct from 3., above. Women and minorities whose students do as well later in follow-up courses as those taught by white males get lower ratings than white males. For online only courses, where the students never meet the instructor or learn the instructor’s true identity, if male or female students think that the instructor is a white male, then they give the instructor higher ratings than if they think the instructor is female or a minority.)
If you talk about a topic about which I have opinions, and I don’t want my opinions evaluated in any way, then I’ll be angry at you. (If a course even raises questions about students’ identity-defining religious, political, etc., beliefs, then students will give lower ratings to the instructor.)
“If I know little or nothing about a subject, I can still tell how much I’ve learned by taking a course in it, because I’ve got lots of experience with my learning and so I know how much I’ve learned.” (For college courses, the strongest correlation ever found in a methodologically sound study between learning and ratings is about 0.2, and there is evidence from another, larger study that the population of students was more likely than most students to take refrain from blaming the instructor for their failures. A correlation of 0.2 indicates that ratings ‘explain’ only about 4% of the variation in the scores. And generally, most people tend to overestimate how much they know; even genuine experts engage in such overestimation. See the literature on the “Dunning-Kruger effect”[25] and Philip Tetlock’s work on expert’s predictions.[26])
I’ll rely on the information most readily available to me and on the most recent influences in responding to surveys about a term. (Recently acquired information and recent influences will have a disproportionate effect on ratings. Handing out chocolate during completion of surveys in class raises ratings for the term significantly. Briefly praising students in class shortly before they complete surveys in class raises ratings for the term significantly.[27])

Read as strict universal generalizations about student behavior, all of the above are clearly false. Read as intended – descriptions of familiar human tendencies – they should engender no surprise at all. The only dispute ought to have been about relative effect size. And the list above is incomplete.[28]

About 3. and 8., above: Do you think that many people are biased against women and minorities? If so, wouldn’t you be amazed if such biases did not influence student ratings of teaching? Indeed, wouldn’t you be suspicious about the claimed validity of any student survey about teaching if its creators claimed that respondents showed little or no evidence of such biases?

There is a kind of hierarchy of vulnerability within the categories of <tenured, non-tenured-tenure-track, non-tenure-track> instructors that is a function of (among other things) conventional attractiveness, gender, race and detectable ethnicity. Probably, tenured White males are generally the least vulnerable[29] and non-tenure-track African American females are generally the most vulnerable.

Given how many factors (scores if not hundreds, very many beyond an instructor’s control) influence student ratings of teaching, it is only at the departmental/discipline level that it might be feasible to correct for a significant percentage of the many kinds of biases that students are likely to express in their ratings, though it will remain difficult in part because the population sizes will often be relatively small.

If for example two female African American non-tenure track instructors teaching simultaneous sections of a mid-sized-to-large course with very similar materials, demographically similar students, similar assignments and grade distributions, were carrying similar professional service loads and were not facing sharply disparate burdens in their personal lives, etc., and one received low ratings on an inclusivity-related survey item while the other did not, then there might be a problem. And the same goes for a similar pair of White male instructors.

When the ratings leave a department for the more remote college and university environments, will the complexity of the relevant judgments be handled with appropriate care, when a career may be at stake?

Often those who draw attention to problems will be told that they have no right to speak up unless they can also propose effective solutions. For example, the current managers of the NSSE chastised Porter for failing to solve the problems on which he’d blown the whistle. It is difficult to know where to begin in demonstrating the patent absurdity of such complaints, which seem intended to silence critics and not to secure remedies. If such complaints were apt then few medical treatments would have been discovered; few of those who first suffer a disease can be their own physicians. (Should the first patient with, say, Ebola virus have been told, “Stop your moaning and find a cure!”?)

However, it is a good idea to be careful what you wish for because you might get it. If student ratings were jettisoned – not that I think this is at all likely – then something worse could be imposed on post-secondary instructors. What could be worse? –Value Added Models (VAMs), (or “growth models”) which have been used with disastrous effects in some of the largest US public school systems. The rough idea is to look at changes in student performance before and after they are taught by a particular teacher for an extended period (e.g., a school year) and to estimate how much of the relevant changes are attributable to the teacher. Their use in practice (in, e.g. CA and NY) has been so harmful that the American Statistical Association, which is generally reluctant to take such public policy positions, issued a warning that the use of VAMs as implemented to date should be discontinued.[30] When, for example, VAMs were used to rate and rank public school system teachers, the “error bars” on the scores received by teachers were often nearly 50% on a 100% scale; so, if a teacher received a rather poor 50%, the actual rating could, according to the model’s designers, be anywhere between 0% and 100%, that is, the scores were “noise.” Yet such ‘noisy’ low ratings were used to fire teachers.

There is still debate among qualified researchers about whether it is feasible to improve VAMs. The claim that they are better than the measures of teacher quality in wider use is an extremely weak defense at best. One of VAMs most well-informed and suitably cautious advocates, economist Douglas N. Harris of Tulane University, authored a book-length defense in which he lists nineteen conditions that must be met if VAMs are to be useful in rating teachers.[31] Some of the conditions concern administrative and political environment (e.g., “Don’t create perverse incentives in … schools.”) and others are more focused on the models themselves (“Average value-added measures over at least two years.”). I asked Harris if it seemed optimistic to suppose that each of just ten of the nineteen conditions has an 80% probability of being met in any given school system; he agreed. Setting the other nine conditions at 1.0, which is wildly optimistic, I pointed out that the probability that all nineteen conditions would be met is then no greater than 80%¹⁰ = 11%. Harris (and Porter when I checked with him) agreed that this is a problem. (If all nineteen conditions are imposed, as Harris recommends, and each is given a very optimistic probability of 80%, then the probability that all nineteen conditions will be met = 80%¹⁹, and drops below 2%.)

Former President of the UNC System Margaret Spellings advocated use of VAMs when she was US Secretary of Education.

Campbell’s Law

“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

Donald T. Campbell, “Assessing the Impact of Planned Social Change” December 1976 Paper #8 Occasional Paper Series (Reprinted with permission of The Public Affairs Center, Dartmouth College), page 49. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.6988&rep=rep1&type=pdf

Not every use of student ratings is invidious. Careful use in evaluating nominations for teaching awards is defensible. Having myself seen some teaching award winners teach, I’m confident that their awards were wholly warranted, even after making a strenuous effort to correct for any biases that I might suffer. (See “About the author of this summary,” at the end, below.)

Postscript on Student Ratings, Grade Inflation and Gaming the Ratings

Although implicit in the title of Johnson (2003), concerns about promoting grade inflation ought to be central along with concerns about injustice to instructors. One of the main ways in which the departments can be pressured to lower standards (and to inflate grades and graduation rates along with them) is by measuring their productivity using student credit hours per full time equivalent faculty member (SCH/FTE).[32] This is the inverse of a what might aptly be called a “Socratic measure” of effective teaching: the lower the ratio of students to instructor, the better, with an optimum close to 1.0 for almost every (if not every) discipline.[33] While the Socratic ideal may now be dismissed as “unrealistic,” (a term that can be a cover for telling students, “You’re not worth it”) that is not a good reason to stand it on its head or to forget about it. An advantage of ideals-based assessment using a Socratic measure is that it begins with a long-known truth about teaching and then looks at how far below the ideal one’s sunk. At least you can tell where you are, even though it’s not where you should be.

If one keeps clearly in mind Berk’s comments on the Factory Worker–Instructor Productivity Analogy and the three reasons that he gives to explain why the analogy breaks down, then the difficulty in determining teaching effectiveness should be unsurprising.

Determining quality of educational practices (teaching effectiveness) is not easy even when the student/teacher ratio is 1.0. Things get a lot more complicated when the ratio is 3.0,[34] and when it exceeds 10.0, one might as well try to juggle ten buttered bowling balls while riding a rusty unicycle on rapidly melting ice in high winds. It probably is feasible to make small improvements even when classes are larger, but one is tinkering with a relatively low mean level of success – relative to what might be achievable with a ratio near to 1.0.

Gaming Ratings

Rodney C. Roberts[35] “Are Some of the Things Faculty Do to Maximize Their Student Evaluation of Teachers Scores Unethical?,” Journal of Academic Ethics (2016) 14:133–148 DOI 10.1007/s10805-015-9247-1

Abstract: This paper provides a philosophical analysis of some of the things faculty do to maximize their Student Evaluation of Teachers (SET) scores. It examines 28 practices that are claimed to be unethical methods for maximizing SET scores. The paper offers an argument concerning the morality of each behavior and concludes that 13 of the 28 practices suggest unethical behavior. The remaining 15 behaviors are morally permissible.

The list of 28 practices is taken from:

Crumbley, D. L., & Smith, G. S. (2000). “The games professors play in the dysfunctional performance evaluation system used in higher education: Brainstorming some recommendations.” In R. E. Flinn & D. L. Crumbley (Eds.), Measure learning rather than satisfaction in higher education (pp. 41–57). Sarasota: American Accounting Association.

Professor Roberts concludes:

If this analysis holds, the answer to the question posed by its title is “yes”—some of the things faculty do to maximize their SET scores are unethical. Among them are the things suggested by Rules 1, 3, 4, 5, 7, 10, 12, 13, 16, 21, 23, 24 and 28. Following any of the other rules is morally permissible.

Appendix to (Roberts 2016): Rules of the Trade (from Crumbley and Smith 2000, 46–47)

[red = STOP!]

1. First and foremost, inflate your students’ grades.

2. Reduce the course material covered and drop the most difficult material first.

3. Give easy examinations (e.g., true-false; broad, open-ended discussion questions; take home exams; open book exams).

4. Join the college party environment by giving classroom parties on SET day. Sponsoring students’ officially approved class skipping days to ball games, etc. is a means to increase student satisfaction. One Oregon professor prepared cupcakes on the day the SET questionnaires are distributed.

5. Give financial rewards such as establishing connections to potential employers.

6. Spoon-feed watered-down material to the students.

7. Give answers to exam questions beforehand. Either pass them out in class or if you want the students to work harder, put them on reserve in the library or on the Internet.

8. Do not risk embarrassing students by calling on them in the classroom.

9. Hand out sample exams, or take your examination questions from the students’ online exercises provided by the textbook publisher.

10. Grade on a wide curve.

11. Give [the] SET as early as possible in the term and then give hard exams, projects, etc.

12. Keep telling students how much they are learning and that they are intelligent.

13. Delete grading exams, projects, and other material. If they turn in work, give them credit. The correctness of the work is not an issue.

14. Teach during the bankers’ hours (9:00–3:00) favored by the students.

15. Give the same exams each semester allowing the answers to get out and grades to move higher and higher each semester.

16. Avoid the effort of trying to teach students to think (e.g., avoid the Socratic method).

17. Provide more free time (e.g., cancel classes on or near holidays, Mondays, Fridays, etc.).

18. Avoid giving a cumulative final exam.

19. Do not give a final exam and dismiss the class on the last class day. Even if the final is administratively required there are methods to avoid the final exam.

20. Use simple slides so the students do not need to read the book and post the slides to the course website from which test questions will be taken.

21. Where multiple classes are taught by different instructors, always ensure that your classes have the highest GPA.

22. Allow students to participate in determining material coverage and the number of points assigned to difficult test questions.

23. When possible, teach classes where common exams are used; then help students pass “this bad exam” for which you are not responsible.

24. Allow students to re-take exams until they pass. It helps to put a page reference next to each question so the students can find the answer during an open-book examination.

25. Give significant above-the-curve extra or bonus credit.

26. Remember to spend the first ten minutes of class schmoozing and joking with the students.

27. In online courses allow the students a two-day window to take the posted online examination.

28. Allow anonymous taking of online examinations by students (i.e., do not use a test center).

Roberts’ Conclusion about How to Raise Student Ratings without being Unethical

[green = GO?]

1. Allow students to participate in determining material coverage and the number of points assigned to difficult test questions.

2. Avoid giving a cumulative final exam.

3. Do not give a final exam and dismiss the class on the last class day. Even if the final is administratively required there are methods to avoid the final exam.

4. Do not risk embarrassing students by calling on them in the classroom.

5. Give [the] SET as early as possible in the term and then give hard exams, projects, etc.

6. Give significant above-the-curve extra or bonus credit.

7. Give the same exams each semester allowing the answers to get out and grades to move higher and higher each semester.

8. Hand out sample exams, or take your examination questions from the students’ online exercises provided by the textbook publisher.

9. In online courses allow the students a two-day window to take the posted online examination.

10. Provide more free time (e.g., cancel classes on or near holidays, Mondays, Fridays, etc.).

11. Reduce the course material covered and drop the most difficult material first.

12. Remember to spend the first ten minutes of class schmoozing and joking with the students.

13. Spoon-feed watered-down material to the students.

14. Teach during the bankers’ hours (9:00–3:00) favored by the students.

15. Use simple slides so the students do not need to read the book and post the slides to the course website from which test questions will be taken.

In other published work, Professor Crumbley has provided a list of over one hundred factors typically beyond an instructor’s control that can affect student ratings of teaching and that are seldom if ever corrected for in evaluating instructors’ teaching. Given this demonstrated capacity for thoroughness, it is shocking that Crumbley’s Rules of the Trade included only 28 items.[36] While I hesitate to add to the list, I will mention one practice that I have seen used and which might inspire others to think more broadly:

Set a reasonable deadline for a major assignment around a third to a half of the way through a course and then, 3-5 days before the deadline, announce that you are giving students an extra 2-3 weeks because they are “working so hard and could do with some extra time.”

It’s a very safe bet[37] that most students won’t have started to work on the assignment before the extension is announced and will be grateful for the additional time to procrastinate further. There is a small risk that the few students who had begun or even completed the work ahead of the deadline will feel (irrationally) resentful, but the benefits should swamp that cost. (Remember: even though it makes little sense for ordinal scales, only average scores count in the ratings game!)

This practice has an added ‘feature’ that it is very difficult to detect as gaming the ratings if it is not done to the same extent every semester. It is also difficult to stop even if detected.[38]

Perhaps there should be a blog devoted to recording any further suggestions. Anonymous submissions could be allowed, even encouraged. A system might be developed for rating suggestions on a number of scales, detectability among them. All and only those who believe that use of student ratings does not contribute to grade inflation would be welcome.

[I am not recommending gaming ratings and would thus not recommend even the gaming strategies that Roberts claims are not unethical. A few of the practices may be defensible for other purposes, however.]

Gaming Syllabi

Some universities archive course syllabi and make them available to students. Some maintain centralized repositories that students can access and some leave it to individual departments to make the syllabi for their courses available. I know of no research on the effect of making syllabi available to students. [39] I suggest some hypotheses for future research.

If one sees students as customers shopping for products that they believe will decrease their time to graduation while minimizing their effort, then the practice of making fully detailed syllabi available to students during pre-registration periods might be defensible.

Hypothesis: Making section-specific information readily available to students prior to the start of the semester would have an effect similar to the effect of making grade distribution information readily available, and thus would likely result in (or intensify) marketing battles, as students shopped for course sections with workloads and policies that they preferred. (Departmental budgets are to a significant extent determined by student credit hours per FTEs generated.) Instructors would have an incentive to conceal or shade the truth. Non-current syllabi may be misleading or irrelevant.

At many universities, the vagaries of the budget cycle and reliance on NTT instructors, can mean that many course sections do not have instructors during the preceding semester. This includes sections to be taught by new T/TT and NTT faculty as well as courses taught by NTT faculty who don’t yet have contracts for the coming semester. Because of the time taken by HR processes, appointments for can be delayed by many weeks. This can easily affect a larger percentage of the course-sections being offered by a department.

Providing just the following the six information items in sample syllabi can and often will be misleading without a great deal of additional detail that cannot be made available during registration.

Course description
Learning outcomes
Evaluative methods
Course materials
Course schedule and
Attendance policy

See below for four illustrations at three levels of detail; many more such examples can easily be imagined.

Illusions of Comparability at Multiple Levels of Detail for Sections or Courses

*Level of Detail 1*	*Course or Sec 1	Course or Sec 2	Course or Sec 3	Course or Sec 4
Textbooks	1	1	2	2
Other Readings	0	0	15	15
Assignments	0	0	3	3
Tests	2	2	4	4
Attendance policy	1 unexcused absence	1 unexcused absence	Attendance not taken	Attendance not taken

*Level of Detail 2*	Course or Sec 1	Course or Sec 2	Course or Sec 3	Course or Sec 4
Total cost of texts	$275	$80	$10 + $0 (free pdf)	$120 + $30
Rdg pages/wk	65	30	25	10
Assignment Length	NA	NA	100 words each	1000/1000/ 3000 words
Test design	50 validated MC each, in class	20 T/F, 5 short essays each, in class,	15 T/F, 15 MC each, open book, online, 24 hrs.	3 with 1 long essay; Final Exam – 3 long essays
Attendance policy	Policy leniently interpreted	Policy strictly interpreted	Those who miss >10% tend to earn D’s or F’s	Meets once weekly. Miss at your peril!

*Level of Detail 3*	Course or Sec 1	Course or Sec 2	Course or Sec 3	Course or Sec 4
Other information	Test questions of no more than modest diffculty.	Mostly very difficult test questions.	Readings at 12^th grade level or lower	Readings akin to cryptic 20^thC poetry (e.g., Prynne, Pound) in difficulty

*“or Sec” indicates that these could be different sections of the same course.

Unless a great deal of detail about work, standards and policies is provided, any information of the sorts displayed above will be uninterpretable and almost useless or it will be misleading in making comparisons between courses or different sections of the same course.

One obligation of a critic is to present the position being criticized at its strongest. The request for a syllabus archive might be defended using an analogy with patient choice of medical care providers. Choosing poorly can result in much wasted time and money, suffering and even death. Patients have often expressed a need for more transparency in demanding information about health care providers and their facilities (e.g., hospitals). The following details the relevant sort of analogy.

*Level of Detail 1*	Physician 1	Physician 2	Physician 3	Physician 4
Education	Top 5 US medical school, chief resident, fellowship, MD/PhD	Top 5 US medical school, chief resident, fellowship, MD/PhD	non-US medical school	non-US medical school
Professional	Full professor at top 5 medical school, board certified	Full professor at top 5 medical school, board certified	Private group practice, board certified	Private group practice, board certified
Office staff	2	2	12	12
Average wait time	20 mins	20 mins	10 mins	10 mins
Billing rate	$600 1^st visit	$600 1^st visit	$150 1^st visit	$150 1^st visit
Success rate	95%	95%	60%	60%
Patient rating

*Level of Detail 2*	Physician 1	Physician 2	Physician 3	Physician 4
Education	Earned highest grades, PhD 6 years of research beyond MD	Barely passed most courses, PhD was 1 year “add on” to MD; his parents were on faculty	Oxford University Medical School (top five), top 5% of class.	Atlantis University of Health Sciences College of Osteopathic Medicine
Professional	Serves on certifying board	See above	Offered faculty positions at several top schools worldwide	Board certified on 5^th attempt
Office staff	Efficient liaisons with 200 medical center staff	Smiling, incompetent relatives	Cordial, efficient and helpful	Moody and inefficient
Average wait time	Never rushes, listens, stays late	Rushes, doesn’t listen, cancels for ‘golf’	Efficient, doesn’t rush, listens	Rushes, quadruple books, talks instead of listening
Billing rate	Waives or pays fees for many	Full payment at time of visit	Accepts what insurance pays	Refers to collection agency after 30 days
Success rate	Takes only very difficult cases	Takes only easy cases	Takes mostly difficult cases	Takes only easy cases
Patient rating	500 responses, mainly other physicians and health-care professionals	10 responses, half of them the result of threats	Unfairly blamed for unavoidable poor outcomes	Botches many cases

Level of Detail 3

Physician 1

Physician 2

Physician 3

Physician 4

Other information

Would have won Nobel Prize if she were not so dedicated to patient care; shy, modest and kind.

Lost one malpractice case – judge was incompetent, but she decided not to appeal to avoid causing pain to confused plaintiff; malpractice rate raised.

Smart, lazy narcissist with psychopathic tendencies, appears highly confident, rules by intimidation.

Settled two malpractice cases at insistence of insurer to save them cost of trial, where he would have won. Unwarranted FDA investigation eventually resolved without finding fault.

She bought her ratings via e-bay, justly loses almost every malpractice case, and is under investigation for Medicaid fraud – outcome not yet public.

The most important line in the tables above is probably “Success rate,” which is worse than useless unless one has reliable and accurate background information. While it is not easy to measure success in medicine, it is often feasible though a fair assessment will require some expertise that few patients have.

Full information may be incomprehensible and less than full information if comprehensible will be grossly misleading unless very carefully presented, accurate and reliable.

It is not true that “something is better than nothing;” the ‘something’ can easily be misleading. It would be better to avoid the misleading information.

Measuring success of individual course/instructor combinations in higher education is far more difficult than in medicine.

The fundamental point is that students are not customers who should be treated as “knowing what they want” and “let the buyer beware.” While their interests are very important (and they may need more opportunities to explore in order to figure out which interests they want to pursue) they are not dispositive. Students are more like patients who may or may not have well-defined, readily discernible symptoms and who are not competent to determine what they need to achieve good health. The latter determination requires a fairly high level of expertise. A modest degree of paternalism is wholly compatible with treating someone with respect and may be required by respect for their actual needs. Part of treating someone with respect is not acceding to requests or demands that they make when it is known that doing so is likely to harm them (because, predictably, they will misunderstand and misuse information).

From: Cristobal Young and Xinxiang Chen, “Patients as Consumers in the Market for Medicine: The Halo Effect of Hospitality,” Aug 8, 2018 http://www.stanford.edu/~cy10/public/Market_for_Medicine.pdf

Drawing on a sample of over 3,000 American hospitals, this research finds that patients have limited ability to observe the technical quality of their medical care, but are very sensitive to the quality of room and board care….

Higher medical quality has a weak effect on patient satisfaction. In contrast, the quality of interaction with nurses has a positive effect size three or four times larger than medical quality. Even relatively minor customer service aspects, such as the quietness of rooms, have as much or more impact on patient satisfaction than medical quality or hospital survival rates. when evaluating the overall hospital experience, patients can find that the non-medical, hospitality aspects of their experience are more visible, visceral, and memorable….

The hospital experience can be understood in a classical Goffmanian sense of having front-stage and back-stage elements …. Front-stage aspects are highly visible to patients, and these mostly relate to the hospitality or hotel amenities of the experience. The back-stage aspects are highly technical medical services and operations, which are mostly invisible to patients. In a sense, form is more visible to patients than content. Visibility is not necessarily well-connected to importance. The things patients can see are not necessarily those that matter for their well-being. Indeed, the skew in what is visible means that consumer satisfaction responses focus on hotel aspects of their stay, with little conscious attention placed on the quality of medical treatment they received, or how well the hospital protected them from risk of accidental injury, illness, or death. Patients end up using the non-medical aspect of their hospital stay as a marker of quality on all dimensions (both seen and unseen)– what we term a halo effect of hospitality….

Today, patient satisfaction is becoming a central dimension on which hospitals and doctors are evaluated. This carries great potential to redirect both patients and hospitals from the core mission of medical excellence. In a medical market with more high-charged incentives, competition for patients may lead hospitals to focus on what their consumers can immediately observe, and economize on what they cannot. In a truly consumer-driven health care system where what matters most is patient satisfaction, we expect to see developments such as 24-hour room service, gourmet meals, HBO channels, designer hospital gowns, non-medical staff to tend to patient comfort, hospital executives recruited from the service industry, and growing capital investments in private rooms, ‘healing gardens,’ atriums, WiFi and waterfalls. Patients suffering through the pains and discomforts of medical treatment will appreciate a higher standard of hospitality. However, this same movement may lead to cutbacks in or crowding out of what medical consumers cannot readily observe: the provision of excellent medical treatment and vigorous commitment to patient safety. Over time, hospitals may become increasingly comfortable places to stay, but less ideal places to undergo medical treatment. This is a market driven health care system that turns hospitals into hotels where our caregivers play concierge …. Deluxe accommodation in hospitals may come to set the gold standard of what good medicine ‘looks like’ ….

In higher education, universities face similar pressures in the competition for student applications. In US News and World Report rankings, the best universities are those that admit the smallest share of their applicant pools. The fast track to a lower acceptance rate is to attract more and more student applicants without admitting them. Such metrics nudge colleges towards a public face of college-as-country-club or summer camp, giving greater leeway to a party culture and sports programs while often downplaying the academic rigor of their programs (Armstrong and Hamilton 2015; O’Neil 2016, Chapter 3). Moreover, college teaching evaluations appear to have a minimal or even negative relationship with student learning, but a strong connection with the easiness of courses (Uttl et al 2017; Braga et al 2014; …). These two customer satisfaction metrics are not pushing colleges towards the best interest of students – high quality, affordable education that can change their life course – but rather towards student experiences that are more immediately likable. [emphasis added]

About the author of this summary:

Although I am “statistically literate,” having begun professional use of statistical methods in 1970 (in consulting on design for reliability and maintainability of large weapon systems, e.g. systems with nuclear submarines and aircraft carriers among their components), I am not a statistician, and I am of course subject to the usual range of cognitive biases.[40]

[Daniel Kahneman[41] (Thinking, Fast and Slow) repeatedly emphasizes how difficult it is to avoid errors in critical thinking; few know more than he does about the kinds of inferential errors that people are inclined to make. Yet Kahneman reports that he himself continues to make such errors and must remain vigilant to detect and correct them[42] despite knowing well the principles of correct deductive and inductive inference taught by basic texts.

What can be done about biases? How can we improve judgments and decisions, both our own and those of the institutions that we serve and serve us? The short answer is that little can be achieved without a considerable investment of effort. As I know from experience, System 1 [the source of “intuition”] is not readily educable. Except for some effects I attribute mostly to age, my intuitive thinking is just as prone to overconfidence, extreme predictions, and the planning fallacy as it was before I made a study of these issues. I have improved only in my ability to recognize situations in which errors are likely…. And I have made more progress in recognizing the errors of others than my own.[43]]

To help mitigate the effects of bias, I’ve studied a lot of the available research from 2003-present – about 800 papers or books, as I noted above – but the full body of research dates from 1930 or so and few human beings live long enough to examine it all. I’ve also relied on the expertise of researchers with good track records in other domains, and I’ve tried to be alert to possible bias in their work. Among the most persistent advocates for continued use of student ratings, there seems to me to be more evidence of bias, as also noted by Porter. Of course, the aforementioned research on bias also applies to the interpretation of student ratings, as also noted in this summary.

In any case, I am certain that my summary is not error free, and that counsels some caution in any recommendations about what to do with results of student ratings.

[1] The use of the label “student ratings” is recommended by Ronald Berk (cited below) and, as he argues, the label matters. “Student-customer satisfaction snapshots,” might be more nearly accurate.

[2] Reliability and validity are of course matters of degree. I base these seven claims on study of over 800 sources concentrated in the time period 2003-2018. I’d planned to provide a full summary of all relevant sources, but now believe that such a summary would go largely ignored. So, I provide this relatively brief summary. Any opinions that I express here are my own and should not be taken as reflecting views of my department, college or university.

[3]See: US EEOC Guidelines, 29 CFR 1607.5 General standards for validity studies, 29 CFR 1607.14 Technical standards for validity studies

https://www.gpo.gov/fdsys/pkg/CFR-2017-title29-vol4/xml/CFR-2017-title29-vol4-part1607.xml and https://www.eeoc.gov/policy/docs/qanda_clarify_procedures.html.

[4] See e.g., Ronald Berk, “Start Spreading the News: Use Multiple Sources of Evidence to Evaluate Teaching,” The Journal of Faculty Development v32 n1 (2013) 73-81 and http://www.ronberk.com/articles/2018_rating.pdf, citing: W. A. Wines, & T. J. Lau, “Observations on The Folly of Using Student Evaluations of College Teaching for Faculty Evaluation, Pay and Retention Decisions and its Implications for Academic Freedom,” William & Mary Journal of Women and the Law, v13 (2006) 167–202.) On the use of statistics in assessing disparate impact, see Kevin Tobia, “Disparate Statistics,” Yale Law Journal v126 (2017) 2382-2420. There is as yet no relevant case law on reliability of student ratings.

[5] R. P. Perry, University of Manitoba, “Review of V Johnson’s Grade Inflation: A Crisis in College Education,” Academe v90 n1 (January-February, 2004) 90-91, the published review by an Education faculty member, accused Johnson of ignoring relevant prior research. All of the supposedly ignored research and more was listed in a table in Grade Inflation and considered carefully by Johnson. With some exceptions, some cited below, the research literature is polarized: that produced by School of Education faculty favors use of student ratings of teaching and the research produced by political scientists, sociologists, statisticians, economists, and marketing research faculty finds them unreliable, invalid and/or biased. Porter/HCM Strategists (2012) (see below) offers one possible explanation for this divide. An investigation of a range of possible explanations for the polarization would make for an interesting social science dissertation.

[6] A large study (10 years, 5454 instructors, 53,658 classes, University of Washington College of Arts & Sciences) similar to Weinberg et al reached much the same conclusions: Andrew Ewing, Essays on Measuring Instructional Quality in Higher Education Using Students’ Evaluations of Teachers and Students’ Grades, PhD Dissertation in Economics, University of Washington, 2009. See also Andrew Ewing, “Estimating the Impact of Relative Expected Grade on Student Evaluations of Teachers,” February 13, 2009, http://www.andrewewing.com/docs/research/relExpGradePaper.pdf published version: Economics of Education Review 31 (2012) 141–154.

[7] http://chronicle.com/article/Education-Researchers-Group/129296/ http://www.insidehighered.com/news/2011/10/07/education_researchers_fight_over_a_journal_and_a_nixed_conference_session

[8] Cited in an early version by Porter: Garry, M., Sharman, S. J., Feldman, J., Marlatt, G. A., & Loftus, E. F. (2002). “Examining memory for heterosexual college students’ sexual experiences using an electronic mail diary.” Health Psychology, 21(6), 629-634. Abstract: To examine memory for sexual experiences, the authors asked 37 sexually active, nonmonogamous, heterosexual college students to complete an e-mail diary every day for 1 month. The diary contained questions about their sexual behaviors. Six to 12 months later, they returned for a surprise memory test, which contained questions about their sexual experiences from the diary phase .… [Except for their accurate recollection of (low) frequency of anal sex, the students grossly overestimated by as much as a factor of four the frequency of vaginal or oral sex; men and women did not differ significantly in their overestimates.]

[9] “The most widely reported claim made by Arum and Roska—that 45 percent of students made “no measurable gains in general skills”—is simply wrong. As both Alexander Astin and John Etchemendy have explained at length, this erroneous claim is due to a common statistical fallacy—namely, a failure to distinguish false positives from false negatives. The fact that 45 percent of students failed to pass a statistical threshold designed to assure us that they in fact improved their basic skills means only that we don’t know how many of these students did in fact improve their basic skills—conceivably quite a large number.” – Bowen, William G., and McPherson, Michael S.. Lesson Plan: An Agenda for Change in American Higher Education. (Princeton University Press, 2016), citing: Alexander Astin, “The Promise and Peril of Outcomes Assessment,” Chronicle of Higher Education, September 3, 2013; and John Etchemendy “Are Our Colleges and Universities Failing Us?” Carnegie Reporter 7, n3 (Winter 2014).

[10] As epidemiologist John P. A. Ioannidis (among others) has pointed out, researchers are not always aware of the influence that their funding sources have and may thus sincerely deny the existence of such influence.

[11] Emphasis added. Porter was a committee member, but not the dissertation director. The director was Professor Paul Umbach. The other committee members were Associate Vice Chancellor Carrie Zelna and Vice Chancellor Mike Mullen. Before beginning his dissertation research, Dr. Standish was an experienced higher education data analyst (oirp.ncsu.edu), who subsequently moved to SAS Institute.

[12] See also: NA Bowman & TE Seifert, “Can College Students Accurately Assess What Affects Their Learning and Development?” Journal of College Student Development, v52 n3 (May-Jun 2011) 270-290; and NA Bowman, “Examining Systematic Errors in Predictors of College Student Self- Reported Gains,” New Directions for Institutional Research n150 (Sum 2011).

[13] Cited in: Linda B. Nilson, “Time to Raise Questions about Student Ratings,” in: James. E. Groccia and Laura Cruz, eds. To Improve the Academy: Resources for Faculty, Instructional and Organizational Development (Jossey-Bass, 2012), Chapter 14.

Portions available at:

http://books.google.com/books?hl=en&lr=&id=dipWZb3AUJQC&oi=fnd&pg=PA213 &dq=linda+nilson&ots=qnyCuVvRz7&sig=nZ7STaU6NTb2Hrgrv- f26RZxapM#v=onepage&q=linda%20nilson&f=false

http://www.clemson.edu/OTEI/about/leadership.html

See also: Linda B. Nilson, “Measuring Student Learning to Document Faculty Teaching Effectiveness,” in To Improve the Academy v32 n1 (June 2013) 287–300. DOI: 10.1002/j.2334-4822.2013.tb00711.x

Abstract: Recent research has questioned the validity of student ratings as proxy measures for how much students learn, and this learning is a commonly accepted meaning of faculty teaching effectiveness. Student ratings capture student satisfaction more than anything else. Moreover, the overriding assessment criterion in accreditation and accountability—that applied to programs, schools, and institutions—is student learning, so it only makes sense to evaluate faculty by the same standard. This chapter explains and evaluates course-level measures of student learning based on data that are easy for faculty to collect and administrators to use.

Nilson had long been seen as an ally by School of Education advocates for student ratings. She was the target of a group attack in a listserv because she criticized the use of student ratings relying on recent research, some cited here. See POD Archives, POD@LISTSERV.ND.EDU, http://listserv.nd.edu/cgi-bin/wa?A0=POD

https://listserv.nd.edu/cgi-bin/wa?A1=ind1310&L=POD#135, “Replacing student evaluation of teaching”

https://listserv.nd.edu/cgi-bin/wa?A1=ind1402&L=POD#74 “NEW Review Articles on Student Ratings & Peer Review of Teaching”

https://listserv.nd.edu/cgi-bin/wa?A1=ind1403&L=POD#118 “Today’s Student”)

[14] These would standardly be taken as “explaining” 0.18²=3% to 0.25²=6% of the variance, thus leaving at least 94% – 97% unexplained, and are indeed weak correlations.

[15] Beleche et al do not attempt to determine the possible effect of the remedial nature of the courses and the characteristics of the students who are required to take them. Valen E. Johnson (2003), page 106, found that the weakest students at Duke were less inclined to attribute their poor performance to instructors than students who performed better. Duke University admits less than 20% of its applicants, UC/Riverside about 85%; a student in the 25th percentile in the freshman class at Duke would have an SAT score over 1300; at UC/Riverside, a student in the 25th percentile in its freshman class would have a score under 1000. It is reasonable to wonder whether the UC/Riverside students studied by Beleche et al are more like the weakest Duke students in the way Johnson describes. In mid-2010, relying on a pre-print of the paper, I wrote to Professor Marks about this but have as yet (Dec 2018) received no reply.

[16] http://www.stat.berkeley.edu/~stark/ “Philip B. Stark is the Associate Dean for the Division of Mathematical and Physical Sciences at UC Berkeley. Stark’s research centers on inference (inverse) problems, especially confidence procedures tailored for specific goals. Applications include the Big Bang, causal inference, the U.S. census, climate modeling, earthquake prediction, election auditing, food web models, the geomagnetic field, geriatric hearing loss, information retrieval, Internet content filters, nonparametrics (confidence sets for function and probability density estimates with constraints), risk assessment, the seismic structure of the Sun and Earth, spectroscopy, spectrum estimation, and uncertainty quantification for computational models of complex systems. In 2015, he received the Leamer-Rosenthal Prize for Transparency in Social Science award. Stark was Department Chair of Statistics [a top 2 PhD program] and Director of the Statistical Computing Facility at UC Berkeley.” He is also a highly experienced and successful expert witness http://www.stat.berkeley.edu/~stark/bio.htm#consulting. See also: https://www.universityaffairs.ca/news/news-article/arbitration-decision-on-student-evaluations-of-teaching-applauded-by-faculty/ .

[17] Both Freishtat and Stark served as experts in a Canadian case, Ryerson Faculty Association v. Ryerson University, in which the arbitrator found for the Ryerson Faculty Association. Accepting Freishtat’s and Stark’s reports, the arbitrator ruled (6/28/18 2018 canlii58446) that student ratings of teaching cannot be used to assess teaching effectiveness. See https://www.universityaffairs.ca/news/news-article/arbitration-decision-on-student-evaluations-of-teaching-applauded-by-faculty/ and links therein, as well as https://ocufa.on.ca/blog-posts/significant-arbitration-decision-on-use-of-student-questionnaires-for-teaching-evaluation/ .

[18] Ryalls, K. R., Benton, S. L., Li, D., & Barr J. (2016). Response to “Bias against female instructors” (IDEA Editorial Note). Manhattan, KS: The IDEA Center. http://ideaedu.org/research-and-papers/editorial-notes/response-to-bias-against-female-instructors/ does, as Berk (2018) notes, “repudiate most” of Uttl et al’s conclusions. In one sense of the term “repudiate” – “refuse to accept or be associated with” – that’s accurate. (The link supplied by Berk is broken. But there are several others that yield IDEA responses, e.g. https://www.ideaedu.org/Portals/0/Uploads/Documents/Response_to_Zero_Correlation_Between_Evaluations_Teaching.pdf ; see also: https://www.ideaedu.org/Portals/0/Uploads/Documents/Response_to_Bias_Against_Female_Instructors.pdf) The survey used by IDEA has 47 items, an invitation to satisficing that makes it more likely that any biases will be obscured. IDEA also makes available a short form with 18 items.

[19] I’m expressing a worry, not rendering a legal judgment (which I’m not competent to provide). “Free legal advice is worth what one pays for it.”

[20] For the slides from a presentation: https://www.stat.berkeley.edu/~stark/Seminars/setCSHE16.htm#1

[21] The results of this relatively small study were re-analyzed by Stark using non-parametric permutation methods. https://www.youtube.com/watch?v=ZgJVShVYh0w The re-analysis showed larger differences between ‘male’ and ‘female’ instructors and stronger evidence of anti-female bias than found in the authors’ analysis.

[22] See also: Michael Hessler et al, “Availability of [chocolate] cookies during an academic course session affects evaluation of teaching,” Medical Education (2018) doi: 10.1111/medu.13627 9 pp.

OBJECTIVES: Results from end-of-course student evaluations of teaching (SETs) are taken seriously by faculties and form part of a decision base for the recruitment of academic staff, the distribution of funds and changes to curricula. However, there is some doubt as to whether these evaluation instruments accurately measure the quality of course content, teaching and knowledge transfer. We investigated whether the provision of chocolate cookies as a content-unrelated intervention influences SET results. METHOD: We performed a randomised controlled trial in the setting of a curricular emergency medicine course. Participants were 118 third-year medical students. Participants were randomly allocated into 20 groups, 10 of which had free access to 500 g of chocolate cookies during an emergency medicine course session (cookie group) and 10 of which did not (control group). All groups were taught by the same teachers. Educational content Availability of cookies during an academic course session affects evaluation of teaching and course material were the same for both groups. After the course, all students were asked to complete a 38-question evaluation form. RESULTS: A total of 112 students completed the evaluation form. The cookie group evaluated teachers significantly better than the control group (113.4 ± 4.9 versus 109.2 ± 7.3; p = 0.001, effect size 0.68). Course material was considered better (10.1 ± 2.3 versus 8.4 ± 2.8; p = 0.001, effect size 0.66) and summation scores evaluating the course overall were significantly higher (224.5 ± 12.5 versus 217.2 ± 16.1; p = 0.008, effect size 0.51) in the cookie group. CONCLUSIONS: The provision of chocolate cookies had a significant effect on course evaluation. These findings question the validity of SETs and their use in making widespread decisions within a faculty. Concluding Remarks: Whether this effect is mostly attributable to the cookies themselves or to the influence of the broader social variable of reciprocity cannot be answered.³¹ Would we have found similar effects if we had offered the students unpalatable kale and celery, a monogrammed commemorative course T-shirt or a coffee mug? Reciprocity might induce demand effects and enhanced evaluations, but it may also increase motivation and commitment to learning the material. This needs to be investigated in future studies.

[23] If survey results are analyzed without checking for satisficing, results may also appear to have greater reliability and validity: Tyler Hamby & Wyn Taylor, “Survey Satisficing Inflates Reliability and Validity Measures: An Experimental Comparison of College and Amazon Mechanical Turk Samples,” Educational and Psychological Measurement 2016, Vol. 76(6) 912–932. The same can obviously apply to lying by respondents.

[24] In some disciplines, e.g. philosophy, religious studies, instructors are virtually obligated to raise questions about student beliefs that are apt to be deeply held and thus instructors in such disciplines are at greater risk than, say, physics or materials science instructors, of retaliatory ratings for this particular reason. This is not at all to imply that STEM instruction is risk-free. There are many ways in which students can become unhappy even when the instructor is excellent.

[25] David Dunning, “The Dunning–Kruger Effect: On Being Ignorant of One’s Own Ignorance,” Advances in Experimental Social Psychology v44 (2011) 247-296, https://doi.org/10.1016/B978-0-12-385522-0.00005-6

[26] https://www.sas.upenn.edu/tetlock/

[27] E.g., see Paul Rice, Chronicle of Higher Education, 10/7/1981. The instructor’s brief complimentary remarks made just before students completed surveys raised the instructor’s ratings by an average of 17% over those received on the previous week’s surveys, which he pretended had been lost. The instructor, a poet, was then teaching at UNC/Asheville. So, chocolate or other material inducements may not be necessary.

[28]Brae V. Surgeoner, Benjamin J. Chapman, Douglas A. Powell, “University Students’ Hand Hygiene Practice During a Gastrointestinal Outbreak in Residence: What They Say They Do and What They Actually Do,” Journal of Environmental Health v72n2 (September 2009) 24-28.

Abstract: Published research on outbreaks of gastrointestinal illness has focused primarily on the results of epidemiological and clinical data collected post–outbreak; little research has been done on actual preventative practices during an outbreak. In this study, the authors observed student compliance with hand hygiene recommendations at the height of a suspected norovirus outbreak in a university residence in Ontario, Canada. Data on observed practices was compared to post–outbreak self-report surveys administered to students to examine their beliefs and perceptions about hand hygiene. Observed compliance with prescribed hand hygiene recommendations occurred 17.4% of the time. Despite knowledge of hand hygiene protocols and low compliance, 83.0% of students indicated that they practiced correct hygiene recommendations. To proactively prepare for future outbreaks, a current and thorough crisis communications and management strategy, targeted at a university student audience and supplemented with proper hand washing tools, should be enacted by residence administration. [emphasis added]. See also: Cristobal Young and Xinxiang Chen, “Patients as Consumers in the Market for Medicine: The Halo Effect of Hospitality,” Aug 8, 2018 (quoted below) http://www.stanford.edu/~cy10/public/Market_for_Medicine.pdf

[29] -though hardly invulnerable: see Clark Glymour, “Why the University Should Abolish Faculty Course Evaluations” https://www.cmu.edu/dietrich/philosophy/docs/glymour/glymour-universityFCE2003.pdf

[30] “ASA Statement on Using Value-Added Models for Educational Assessment,” April 8, 2014, https://www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf

[31] Value-Added Measures in Education: What Every Educator Needs to Know (Harvard Education Press, 2011), pp173, 201. https://catalog.lib.ncsu.edu/record/NCSU2645203

[32] Providing students during advance registration with grade distributions for courses and instructors (https://tools.wolftech.ncsu.edu/gradient/ ) and giving them ready access to details about course workload as specified in syllabi leads to course-shopping for GPA-yield, enrollment shifts, marketing wars, and thus also promotes grade inflation. Syllabi can easily be designed to be accurate while also misleading students about workload. Providing grade distributions has been shown to promote grade inflation; see: Talia Bar, Vrinda Kadiyali, and Asaf Zussma, “Quest for Knowledge and Pursuit of Grades: Information, Course Selection, and Grade Inflation.” http://ssrn.com/abstract=1019580 and Talia Bar, Vrinda Kadiyali, and Asaf Zussma, “Grade Information and Grade Inflation: The Cornell Experiment,” Journal of Economic Perspectives, Volume 23, Number 3 (Summer 2009) 93–108. And then there is the effect of technology-enhanced academic dishonesty, which facilitates and amplifies common tendencies (of >50%) to cheat or plagiarize.

[33] The decline in effectiveness may be, and likely is, non-linear. For example, there might be a steep drop between 1 and 3, a linear decline from 3 to 19, another steep drop around 20, and then a more gradual decline until much larger ratios (>100) are at issue. There is some evidence for non-linearity around 20.

[34]Divorce, which involves only two individuals, is notoriously difficult to predict, even with massive quantities of relevant data across time. The oft-repeated, inflated claims of Gottman et al – that divorce is readily predictable, via correlations with specified behaviors – result from what is now regarded as a classic instance of over-fitting of data. See, e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1622921/ and http://andrewgelman.com/2010/03/13/shooting_down_b/ referring to http://www.slate.com/articles/double_x/doublex/2010/03/can_you_really_predict_the_success_of_a_marriage_in_15_minutes.html

[35] robertsr@ecu.edu Department of Philosophy & Religious Studies, East Carolina University, Brewster A-327, Mail Stop 562, Greenville, NC 27858.

[36] Maybe his co-author Smith or co-editor Flinn held Crumbley back. Maybe they foisted their qualms on him.

[37] Any takers?

[38] It’s also surprising that the Oregon professor mentioned in 4. of the Rules of the Trade offered cupcakes only on the day on which student ratings were being filled out in class. Instead, offering snacks and (non-alcoholic, detectable-drug-free) beverages every day that the class met would have been a much more effective and less clearly objectionable practice. That can get expensive, but it is after all an investment in long-term success and might even be tax deductible. (Check with at least one of your tax attorneys first; that’s what retainers are for!) End-of week feeding sessions might be an effective alternative practice. And don’t forget the chocolate!

[39] Searches of relevant online databases and consultation with Professors Talia Bar, Stephen Porter (NC State College of Education, Faculty Senate), and Paul Umbach (NC State College of Education) located no research, published or unpublished, about the likely effects of making course section workload specific information readily available online to students during registration. This is in itself cause for concern.

[40] A list of over 50 is at: http://cogsci.uwaterloo.ca/courses/phil145.html

[41] http://www.princeton.edu/~kahneman/

[42] Kahneman also offers an anecdote about the well-known cognitive psychologist Robyn Dawes, who also “knew better” than to make a common framing error but did so anyway. (Thinking, Fast and Slow, 149).

[43] Thinking Fast and Slow, 417.