Examining Bias in Course Evaluations

by Gregory Hum, Assistant Director, Teaching Assessment, CTSI

Popular articles about course evaluations are often brought to our team’s attention, and many of these question the value and validity of course evaluations. For instance, a recent Inside Higher Ed article was published that discussed gender bias. This blog post will examine that article in detail and describe how as part of continuous assessment of our University of Toronto system, we are committed to ongoing examination and consideration of potentially biasing factors in course evaluations, including gender bias. I will broadly address this issue in terms of: 1) what the research says; 2) what actions are underway at the University of Toronto.

What the research says

Despite the headlines that make it seem as if course evaluations are hugely biased by gender, and while individual studies have found contravening results, the plurality of research regarding gender bias in course evaluations do not support the notion that gender is a major factor in course evaluations. For instance, as found in the extensive review by Pamela Gravestock and Emily Gregor-Greenleaf (2008) that was done at the outset of the current course evaluations framework at the University of Toronto:

“In general, studies relating to gender have produced inconclusive results, but most have shown that this variable has little or no impact on evaluations (Algozzine et al., 2004; Theall & Franklin, 2001; Marsh & Roche, 1997; Cashin, 1995; Arreola, 2000; Aleamoni & Hexner, 1980).” (p. 41)

A recent rigorous and comprehensive book written by Professor of Higher Education, Nira Hativa (2008), also came to the same conclusion in reviewing course evaluations:

“The majority of these studies reported no significant differences between genders suggesting that teacher gender does not bias student ratings. Most other reviews of studies of gender-SRI relationships have also concluded that these ratings have no strong or regular pattern of gender-based bias. That is, teacher gender has little or no impact on how students rate their teachers and thus does not bias SRI results (Algozzine et al., 2004; Arreola, 2000; Cashin, 1995; Feldman, 1992a,b; Gravestock & Gregor-Greenleaf, 2008; Huston, 2005; Theall & Franklin, 2001; Wright & Jenkins-Guarnieri, 2012).” (p. 85-86)

How then, does this impression arise? For the most part, this is due to the selective choice of research that popular articles have chosen to highlight. As is often the case with popular reporting of research, a finding of difference is more interesting and likely to be written about than a finding of no difference or effect (even if these are actually the majority of findings). The studies that have shown gender biases have often been relatively small studies and/or have shown very small effects. This is true of the Inside Higher Ed article.

The MacNell (2014) paper had 43 participants. At first glance this seems like a relatively large number of student participants. However, these participants were divided into four groups for this study (two with a male “assistant instructor” and then two with a female “assistant instructor”). Each instructor was then identified to their students as male or female. This formed four groups, each of which was quite small (ranging from 8 to 12; see below). With so few students, it would be difficult to say that they are representative of all students’ attitudes towards instructor gender, and further, relatively small studies such as these can often turn up effects by pure chance and thus replication and a consideration of multiple studies is important (i.e., the reviews cited above).

Innov High Educ (2015) 40:291-303
Experimental Design

  • Discussion Group A (n =8), Instructor’s Perceived Gender: Female, Instructor’s Actual Gender: Female
  • Discussion Group B (n =12), Instructor’s Perceived Gender: Female, Instructor’s Actual Gender: Male
  • Discussion Group C (n =12), Instructor’s Perceived Gender: Male, Instructor’s Actual Gender: Female
  • Discussion Group D (n =11), Instructor’s Perceived Gender: Male, Instructor’s Actual Gender: Male

As well, while MacNell cites the online nature of the course as providing an opportunity to “test” gender bias in purely experimental conditions by manipulating students’ perceived gender of the instructor, this is not as clear an experiment as it seems. First, MacNell is testing “assistant instructors” rather than the actual main instructor of the course. Second, given the many differences between face-to-face and online courses, it is questionable to assume that this study would extend to course evaluations or gender perceptions more generally. Basically, it is not usually appropriate to extend findings from one context to another (i.e., assistant instructors in an online course to regular professors in face-to-face classrooms) when they are so fundamentally dissimilar.

Turning to the Stark paper cited in the Inside Higher Ed article, it is important to consider that, in addition to simply whether an effect exists or not, we should also consider the size of any effect detected. In the case of the Stark paper, for instance, with very large sample sizes (as is the case here with roughly 23,001) we can detect infinitesimally small effects. That is, there is a gender difference but it is so small that it is not practically relevant.

Let us look first at their broad findings:

Average correlation between SET and instructor gender

 rho bar  p
Overall 0.09 0.00
History 0.11 0.08
Political institutions 0.11 0.10
Macroeconomics 0.10 0.16
Microeconomics 0.09 0.16
Political science 0.04 0.63
Sociology 0.08 0.34

p-values are two-sided.

The left value is the correlation coefficient which translates to the effect size. The latter p-value to the right is the significance value (p-values smaller than 0.05 are deemed ‘significant’ in that an effect has been detected, but this says nothing about the size of the effect). For the correlation coefficient, effect sizes range from -1 to 1 with 0 in between as having no effect. Generally, effects under 0.30 are considered “small” effects. At 0.08-0.11 these are incredibly small effects. In terms of significance values, the threshold of 0.05 is typically taken as being statistically significant. Thus only the aggregate “overall” showed an effect and while it was detectable statistically, the effect was infinitesimally small (see below for how small this is).

The specific finding cited in the Inside Higher Ed article is in Table 5, suggesting that it is male students who are biased against female instructors.

Average correlation between SET and gender concordance


Male student Female student
rho bar p rho bar p
Overall 0.15 0.00 0.05 0.09
History 0.17 0.01 -0.03


Political institutions 0.12




-0.11 0.12
Macroeconomics 0.14 0.04 -0.05 0.49
Microeconomics 0.18 0.01 -0.00 0.97
Political science 0.17 0.06 0.04 0.64
Sociology 0.12 0.16 -0.03 0.76

p-values are two-sided.

Again, we can see that while they cited what sounded like quite significant p-values (and thus detectable differences), correlations are very low, ranging from 0.12-0.18. To put in perspective how small this effect is, the percentage of variation explained by a Pearson correlation is given by the square of the correlation coefficient. Thus, the strongest correlation from this study, 0.18 would mean that even the male students in microeconomics (0.182=0.0324) who appear the most biased in this study, only base 3% of their rating decision on the gender of the instructor.

What we are doing at the University of Toronto

Ultimately, however, what matters most is our own context: is there gender bias in course evaluations at the University of Toronto? This possibility is something that we are actively engaged in studying, and if necessary, confronting. As we engage in an examination of this and other equally important questions, we will share the results of our analyses with the broader University of Toronto community.

At present, while our analyses are in their early stages, thus far, no clear difference exists in course evaluation scores between genders.  Based on our current analyses, the average ratings given to male and female instructors globally across the University of Toronto are identical. This is not to say, however, that there is no gender bias within particular departments, or that one might emerge in the future. Thus, we will continue looking closely to ensure the integrity and fairness of the evaluations instructors receive.

As well, we continue to emphasize that education is a key component. It is important to note that student evaluations of teaching, by design, form only one part of the overall assessment of teaching at the University of Toronto. Administrators and instructors are advised not to use course evaluations as the sole source of information regarding an instructor’s teaching and/or effectiveness. Course evaluations are to be compared and contextualised according to all sources of available data. For instance, in the assessment of Teaching Dossiers, evaluators are instructed to make reasoned judgements across all sources of data, whereby course evaluations are but one part of the holistic evaluation. This assessment also includes accounting for potential biasing factors. Thus, should we detect an important systematic bias in terms of gender, or any other variable, we will be sure to flag these for consideration.


In brief, thus far there is no strong evidence in the broader literature and research for gender bias. That said, at the University of Toronto we will actively engage in conducting our own internal research to look for and confront any potential sources of bias.


Boring, A., Ottoboni, K. & Stark, P.B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research.

Gravestock, P. & Gregor-Greenleaf, E. (2008). Student course evaluations: research models and trends. Higher Education Quality Council of Ontario: Toronto.

Hativa, N. (2014). Student ratings of instruction: recognizing effective teaching. Oron Publications: USA.

MacNell, L., Driscoll, A. & Hunt, A.N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education. 40(4). 291-303.