Equity. Diversity. Inclusion. I have held these values – at least as I define them – my whole life, long before they became a mantra in academia and elsewhere. So why, then, do I often find myself at odds with their modern-day proponents? Part of the reason is that rigorous statistical thinking – the foundation of my world view – rarely governs discussions of their application in practice. I frequently encounter misunderstandings of three statistical concepts in this context: causality, direct versus total effects, and population-level effects. In this article, I describe each of these concepts and provide examples that have arisen in discussions with faculty members at Simon Fraser University (SFU) where they have been wrongly ignored or applied. I will conclude with recommendations for ways in which statistical methods could be used to advance equity, diversity, and inclusion (EDI).
The controversial practice of street checks by police has been one such topic of discussion. The Vancouver Police Department (VPD) has been under fire because its officers disproportionately stop Black and Indigenous people. For example, in 2019, Indigenous people represented 2% of Vancouver’s population but 16% of those stopped by police. Likewise, Black people represented less than 1% of Vancouver’s population but 5.2% of those stopped. Activists have accused the VPD of racism, i.e., they have assumed that racism is the cause of these discrepancies.
To assess the truth of this accusation, we first need a clear definition of racism – a tall order given the many varied and vague definitions currently in use. Ibram X. Kendi, in his best-selling book, How to Be an Anti-Racist, provides the clearest modern definition I have encountered. Kendi defines actions and policies that lead to inequalities among racial groups as racist. Under this definition, the VPD data do, in fact, suggest that street checks are racist; race is statistically associated with the probability of being stopped. But I disagree with Kendi’s definition. To me, the critical question in this case is, “If we had two individuals, identical except for race, with identical behaviour in identical circumstances, would one of the individuals be more likely than the other to be stopped by police?” Under this definition, claiming that the VPD data provide evidence of racism requires more than just evidence of inequalities. To determine whether race causes the observed inequalities, we would need key contextual information about individual stops. For example, if Black and Indigenous individuals more often appeared unwell or unsafe (leading officers to stop them to check on their condition, per VPD policy), then their condition, not their race, could explain their increased probability of being stopped. Without access to such variables, we cannot say anything about the causal effect of race.
To determine causal relationships, the gold standard in statistics is the randomized controlled trial (RCT). The key idea is that the experimenter randomly assigns the level of the factor of interest (“treatment”) to the experimental units. Studies that do not use such randomization (e.g., the VPD case) are called observational. The RCT framework is ideal for assessing the causal effects of factors that can be manipulated. For example, in COVID-19 vaccine trials, researchers are able to assign a treatment (vaccine or placebo) at random to volunteers. In general, we are not able to measure all variables that affect a response, regardless of whether the study is an observational study or an RCT (e.g., in a vaccine trial, we cannot measure individuals’ underlying susceptibility to COVID-19). However, in RCTs, unlike in observational studies, the distributions of these variables tend to be similar across treatment groups. Consequently, differences in outcomes across groups are likely caused by differences in treatment, not by differences in other variables – a likelihood that increases with sample size. In the social sciences, some RCTs (the famous résumé experiment and its offshoots, online class experiments, etc.) suggest causal effects of race and gender on evaluations of individuals’ skills and performance. (The cited online class experiment is a correctly designed RCT, but the analysis of the teaching evaluation outcomes is flawed. Boring et al (2016) provide a correct analysis.) However, the vast majority of studies rely on observational data, where the factor of interest was not controlled by the researcher. In those cases, we generally cannot make causal statements.
How, then, can we assess the causal effects of factors that are immutable (e.g., race) or that we cannot ethically manipulate (e.g., cigarette consumption)? One important step is identification of the so-called direct, indirect, and total effects of the factor of interest. The directed acyclic graph (DAG) allows the visualization of these effects by making explicit the relationships among variables and potential causal pathways. (An excellent resource for understanding and creating DAGs is dagitty.net.) Under this framework, racism can be clearly specified as a non-zero direct effect of race. To illustrate, I use the work of Fuji Johnson and Howsam (2020) – henceforth called FJH – on racial diversity among Canadian university administrators, which has been receiving attention at SFU and elsewhere. Based on a survey of five Canadian universities, these authors conclude that “White men have easy access to all administrative ranks, and while White women appear to be making it through to senior administrative ranks, racialized women and men are getting stuck in the middle ranks.” To translate their work into statistical terms, for each individual, the authors are interested in the outcome variable administrative rank and the exposure variables race and gender. Figure 1 is a (highly oversimplified) DAG representing possible causal pathways relating race to administrative rank. The direct effect of race on the probability of becoming a senior administrator is represented by the arrow connecting race and administrative rank. To estimate this effect, the analyst must control for any confounding variables (variables that affect both race and administrative rank). In this case, country of origin (via English fluency) is a confounding variable; at least one of country of origin and English fluency must be accounted for in a careful analysis of possible racism. However, the race effect reported by FJH ignores confounding variables. The discrepancies observed could be due to the fact that non-White faculty were, in general, less fluent in English. We cannot know.
The DAG in Figure 2 focuses on the effect of gender on administrative rank. This DAG shows a different type of causal pathway, reflecting Charlotte Whitton’s oft-quoted quip, “Whatever women do they must do twice as well as men to be thought half as good.” If Whitton is correct, quality of work (via effort) is a mediating variable in the relationship between gender and administrative rank. Specifically, the effect of gender on the probability of becoming a senior administrator could be direct (as represented by the arrow connecting gender and administrative rank) or indirect via effort. The direct effect represents sexism. FJH’s analysis suggests that the total effect of gender (in the simplest case, the sum of the direct and indirect effects) is small. But this finding does not imply that the direct and indirect effects are also small, i.e., that sexism is not at play in the hiring of administrators. The small total effect could result from a positive indirect effect (women put in more effort, making their work of higher quality, making them more likely to be chosen for an administrative role) and a negative direct effect (all other factors being equal, women are less likely to be chosen). In other words, to assess whether sexism plays a role, we would need to control for effort.
FJH’s analysis of the total effects of race and gender is equivalent to assessing equality of outcomes (“equity”, as defined by Kendi). If the goal of such studies is, in fact, to assess equality of opportunity (a more sensible definition of “equity”, in my opinion), analysis of the direct effects of race and gender is required. For example, Roland Fryer’s work in understanding the direct effect of race on police violence (by adjusting for confounding or mediating variables such as suspect and officer demographics, encounter characteristics, and whether the suspect had a weapon), is far more useful than studies that simply show an association between police violence and race; if our goal is to reduce such violence, we must identify its causes and the relative magnitudes of the effects of these causes. Similar arguments apply if we wish to understand the gender gap in salaries, STEM fields, and executive-level managerial positions.
The last statistical concept I will discuss is population-level effects. As an example, consider the claim made by some SFU faculty members of a direct effect of gender on teaching evaluations, i.e., of sexism as the cause of female instructors’ lower overall teaching scores. If true, all other factors being equal, the average scores of the population of female instructors are lower than those of their male counterparts. However, the statement says nothing about sexism in the case of individual female instructors. In other words, an individual female instructor’s scores may reflect a negative, non-existent, or even positive effect of her gender! We cannot know. For this reason, the suggestion I have heard to add some (arbitrary) positive number to female instructors’ overall scores is misguided. Likewise, across-the-board raises or salary bonuses for female faculty members (as have occurred at SFU and elsewhere) seem unjustifiable.
The misunderstanding of the notion of population-level effects was on clear display in December 2020 at SFU. A Black SFU alumnus violated COVID-19 protocols by coming to campus and purchasing food at the Dining Hall. The alumnus, who was known to Campus Security, refused to leave when asked. The situation escalated, and the RCMP were called. The alumnus continued to refuse to leave and ultimately put a police officer in a chokehold. As a result, the officer discharged his taser. The alumnus was arrested and charged. Much of the incident was captured on video, which was then shared on social media (after selective editing). In the days following, the SFU community erupted in outrage. The past and present executives of the Simon Fraser Student Society (SFSS) led the charge, making public statements such as “these instances of racism on campus are nothing new” and “SFU is a sick institution”. Subsequently, a group of SFU community members penned an open letter to the SFU President, stating that “being Black at SFU is not safe, accepted, or appreciated” and that they “are relentlessly violated by persistent anti-Black racism and state-sanctioned violence where [they] live and work”. The letter garnered over 700 signatures, including over 100 from SFU faculty members. The SFSS went so far as to demand that police never again be called on a Black or Indigenous person. But, importantly, none of these accusations was accompanied by evidence of anti-Black racism as motivation for the actions of the Dining Hall staff, Campus Security, or the RCMP. Likewise, a later, independent review of the incident found no evidence of racial profiling or deviations from security protocols.
What, then, is the explanation for the widely held belief that racism was at play in this incident? Research (e.g., by Roland Fryer) has shown that Black Americans, as a population, are more likely than White Americans to experience non-lethal violence at the hands of police, even after adjusting for important confounding and mediating variables. I will assume that these findings apply to Canadians as well. However, we have no way of knowing whether the police discriminated against this particular Black individual on the basis of race. Some faculty members cited the “balance of probabilities” when claiming racism in this case. But this argument requires knowledge of the probabilities that two individuals, alike except for race, would experience police violence in the circumstances surrounding the Dining Hall incident. To my knowledge, these probabilities are unknown. Moreover, the claim of racism would be credible only if the probability were very high (say, over 90%) for the Black individual and very low (say, less than 10%) for the White individual. I can conclude only that misunderstanding of the concept of population-level effects is widespread.
In summary, to address real EDI problems and avoid wasting resources on non-problems, broader understanding of statistical methods is advantageous. Statistical methods should play a greater role in informing the design of studies, interpreting their outcomes, and making decisions about which EDI initiatives to pursue. Claims about causes of inequalities without application of proper causal inference methods and claims of discrimination against individuals in the absence of individual-level evidence should be discouraged. We should learn to accept the frustration of not being able to know the causes of many inequalities and events, however heartbreaking, and acknowledge the harm caused by wrongly assuming these causes. Our focus should be on applying scientific principles to identify both the causes of inequalities and the magnitudes of the effects of these causes. With that information in hand, we are poised to address systematic injustice in our society most effectively.
Figure 1: The relationship between race and administrative rank: possible causal pathways.
Figure 2: The relationship between gender and administrative rank: possible causal pathways.