Simpson’s Paradox

Case Study: Berkeley Admissions

In fall 1973, the University of California, Berkeley’s graduate division admitted 44% of male applicants and 35% of female applicants. School administrators were concerned about the potential for bias (and lawsuits!) and asked statistics professor Peter Bickel to examine the data more carefully.


We have a subset of the admissions data for 6 departments.

load("~/Documents/GitHub/STA440/2020/decks/data/UCBadmit.RData")
d=UCBadmit
library(tidyverse)
d <-
  d%>%
  mutate(male=ifelse(applicant.gender=="male",1,0),
         dept_id = rep(1:6, each = 2))
d$successrate=d$admit/d$applications
sum(d$admit[d$male==1])/sum(d$applications[d$male==1])
## [1] 0.4451877
sum(d$admit[d$male==0])/sum(d$applications[d$male==0])
## [1] 0.3035422

We see in this subset of departments that roughly 45% of male applicants were admitted, while only 30% of female applicants were admitted.

Because admissions decisions for graduate school are made on a departmental level (not at the school level), it makes sense to examine results of applications by department.

d[,c(1,2,3,4,7)]
##    dept applicant.gender admit reject dept_id
## 1     A             male   512    313       1
## 2     A           female    89     19       1
## 3     B             male   353    207       2
## 4     B           female    17      8       2
## 5     C             male   120    205       3
## 6     C           female   202    391       3
## 7     D             male   138    279       4
## 8     D           female   131    244       4
## 9     E             male    53    138       5
## 10    E           female    94    299       5
## 11    F             male    22    351       6
## 12    F           female    24    317       6

Hmm, what’s going on here?

Following McElreath’s analysis in Statistical Rethinking, we start fitting a simple logistic regression model and examine diagnostic measures.

The model for department \(i\) and gender \(j\) with \(n_{admit,ij}\) of \(n_{ij}\) applicants admitted is given as:

\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha+\beta\text{male}_{j}\)

adm1 <-
  glm(data = d, family = binomial,
      cbind(admit,reject) ~ 1 + male )
summary(adm1)
## 
## Call:
## glm(formula = cbind(admit, reject) ~ 1 + male, family = binomial, 
##     data = d)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -16.7915   -4.7613   -0.4365    5.1025   11.2022  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.83049    0.05077 -16.357   <2e-16 ***
## male         0.61035    0.06389   9.553   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.06  on 11  degrees of freedom
## Residual deviance: 783.61  on 10  degrees of freedom
## AIC: 856.55
## 
## Number of Fisher Scoring iterations: 4

Here it appears male applicants have \(e^{0.61}=1.8\) (95% CI (1.6, 2.1)) times the odds of admission as female applicants.

Model Checking

How do our model’s predictions align with observed probabilities?

male=c(1,0)
predadm1=c(exp(adm1$coefficients[1]+adm1$coefficients[2])/(1+exp(adm1$coefficients[1]+adm1$coefficients[2])),exp(adm1$coefficients[1])/(1+exp(adm1$coefficients[1]))) #this is so you see how to get the prediction, not the most efficient code
# more efficient code comes later :)
predprob=cbind(male,predadm1)
d1=merge(d,predprob)
d1[,c(1,2,8,9)]
##    male dept successrate  predadm1
## 1     0    A  0.82407407 0.3035422
## 2     0    B  0.68000000 0.3035422
## 3     0    C  0.34064081 0.3035422
## 4     0    D  0.34933333 0.3035422
## 5     0    E  0.23918575 0.3035422
## 6     0    F  0.07038123 0.3035422
## 7     1    A  0.62060606 0.4451877
## 8     1    B  0.63035714 0.4451877
## 9     1    C  0.36923077 0.4451877
## 10    1    D  0.33093525 0.4451877
## 11    1    E  0.27748691 0.4451877
## 12    1    F  0.05898123 0.4451877

Eew, this model is way off the mark! Certainly there are some large departmental effects, so let’s fit a more flexible model.

Multiple Logistic Regression

\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha+\beta\text{male}_{j}+\gamma_1I(\text{dept}_i=B)+\gamma_2I(\text{dept}_i=C)\) \(+\gamma_3I(\text{dept}_i=D)+\gamma_4I(\text{dept}_i=E)+\gamma_5I(\text{dept}_i=F)\)

adm2 <-
  glm(data = d, family = binomial,
      cbind(admit,reject) ~ 1 + male + dept)
summary(adm2)
## 
## Call:
## glm(formula = cbind(admit, reject) ~ 1 + male + dept, family = binomial, 
##     data = d)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
## -1.2487   3.7189  -0.0560   0.2706   1.2533  -0.9243   0.0826  -0.0858  
##       9       10       11       12  
##  1.2205  -0.8509  -0.2076   0.2052  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.68192    0.09911   6.880 5.97e-12 ***
## male        -0.09987    0.08085  -1.235    0.217    
## deptB       -0.04340    0.10984  -0.395    0.693    
## deptC       -1.26260    0.10663 -11.841  < 2e-16 ***
## deptD       -1.29461    0.10582 -12.234  < 2e-16 ***
## deptE       -1.73931    0.12611 -13.792  < 2e-16 ***
## deptF       -3.30648    0.16998 -19.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.056  on 11  degrees of freedom
## Residual deviance:  20.204  on  5  degrees of freedom
## AIC: 103.14
## 
## Number of Fisher Scoring iterations: 4

Note the gender term no longer reaches statistical significance

Model Checking

predlogit=predict(adm2,data=d)
predprob=exp(predlogit)/(1+exp(predlogit))
d2=cbind(d,predprob)
d2[,c(1,2,8,9)]
##    dept applicant.gender successrate   predprob
## 1     A             male  0.62060606 0.64153930
## 2     A           female  0.82407407 0.66416742
## 3     B             male  0.63035714 0.63149912
## 4     B           female  0.68000000 0.65441963
## 5     C             male  0.36923077 0.33613931
## 6     C           female  0.34064081 0.35877694
## 7     D             male  0.33093525 0.32903451
## 8     D           female  0.34933333 0.35144696
## 9     E             male  0.27748691 0.23916654
## 10    E           female  0.23918575 0.25780964
## 11    F             male  0.05898123 0.06154717
## 12    F           female  0.07038123 0.06757450

This model provides a much closer fit to the data.

What Happened?

d[,c(1,2,8)]
##    dept applicant.gender successrate
## 1     A             male  0.62060606
## 2     A           female  0.82407407
## 3     B             male  0.63035714
## 4     B           female  0.68000000
## 5     C             male  0.36923077
## 6     C           female  0.34064081
## 7     D             male  0.33093525
## 8     D           female  0.34933333
## 9     E             male  0.27748691
## 10    E           female  0.23918575
## 11    F             male  0.05898123
## 12    F           female  0.07038123

In the raw data, women had higher acceptance probabilities than men in 4 of the 6 departments. However, the departments to which they applied in higher numbers were the departments that had lower overall acceptance rates.

What Happened?

What happened is that women were more likely to apply to departments like English, which have financial trouble supporting grad students, and they were less likely to apply to STEM departments, which had more plentiful funding for graduate students. The men, on the other hand, were much more likely to apply to the STEM departments that had higher acceptance rates.