Generalized Linear Mixed Effects Models

Generalized Linear Mixed Effects Model (GLMM)

We now discuss the incorporation of random effects into generalized linear models. The basic idea is that we assume there is natural heterogeneity across groups in a subset of the regression coefficients. These coefficients are assumed to vary across groups according to some distribution.

Conditional on the random effects, we assume the responses for a single subject are independent observations from a distribution in the exponential family.

In the GLMM, we assume the conditional distribution of each \(Y_{ij}\), conditional on \({\bf b}_i\), belongs to the exponential family with conditional mean \[g(E[Y_{ij} \mid {\bf b}_i])={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i,\] where \(g(\cdot)\) is a known link function.

Assume the \({\bf b}_i\) are independent across subjects with \({\bf b}_i \sim N({\bf 0}, {\bf D})\). We also assume that given \({\bf b}_i\), the responses \(Y_{i1}, \ldots, Y_{in}\) are mutually independent.

Example: Multilevel Linear Regression

\[Y_{ij}={\bf X}_{ij}'\boldsymbol{\beta}+b_i + \varepsilon_{ij},\] where \(b_i\overset{iid}\sim N(0,\sigma_b^2) \perp \varepsilon_{ij} \overset{iid}\sim N(0,\sigma_e^2)\)

and \(E(Y_{ij} \mid b_i)={\bf X}_{ij}'\boldsymbol{\beta}+b_i\) and \(E(Y_{ij}) = {\bf X}_{ij}'\boldsymbol{\beta}\)

An example of this is the radon study we previously discussed, where our response was radon level (logged), elements of \({\bf X}\) (fixed effects) included the floor of the home (in some models) and the uranium content of the soil. We had a random effect \(b_i\) for county (and in an extended model added a random slope for floor).

Example: Multilevel logistic model with random intercepts

\[\text{logit}(E(Y_{ij} \mid b_i))={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\] where \(b_i\sim N(0,\sigma^2)\)

Question: what happened to \(\varepsilon_{ij}\)?

Example: Random coefficients Poisson regression

\(\log(E(Y_{ij} \mid {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i,\) with \({\bf X}_{ij}={\bf Z}_{ij}=[1,t_{ij}]\) (random slopes and intercepts) and \({\bf b}_i \sim N({\bf 0}, {\bf D})\).

Interpretation of GLMM Estimates

In the model \[\text{logit}(E[Y_{ij} \mid b_i])={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\] with \(b_i \sim N(0,\sigma^2)\), each element of \(\boldsymbol{\beta}\) measures the change in the log odds of a ‘positive’ response per unit change in the respective covariate, in a specific group that has an underlying propensity to respond positively given by \(b_i\).

That is, we need to hold the group membership constant when interpreting \(\beta_k\), just as we would hold the values of \(\bf{x}_{i,(-k)}\) constant when interpreting \(\beta_k\)

Note also that with a non-linear link function, a non-linear contrast of the averages is not equal to the average of non-linear contrasts, so that the parameters do not in general have population-average interpretations (as they would in a linear mixed effects model, which has identity link). So while in the lmm \[g(E(Y_{ij} \mid {\bf X}_{ij}, {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i\] so that \(E(Y_{ij} \mid {\bf X}_{ij})={\bf X}_{ij}'\boldsymbol{\beta}\), when \(g(\cdot)\) is non-linear (say the logit), then \[g(E(Y_{ij} \mid {\bf X}_{ij}))\neq {\bf X}_{ij}'\boldsymbol{\beta}\] for all \(\boldsymbol{\beta}\) when averaged over the distribution of the random effects.

Estimation using ML

The joint probability density function is given by \[f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i).\] However, recall that the \({\bf b}_i\) are unobserved. How then do we handle the \({\bf b}_i\) in the maximization?

Typically, we base frequentist inferences on the marginal (integrated) likelihood function, given by \[\prod_{i=1}^N \int f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i) d{\bf b}_i.\] Estimation using maximum likelihood involves a two-step procedure.

ML Estimation Steps

Obtain ML estimates of \(\boldsymbol{\beta}\) and \({\bf D}\) based on the marginal likelihood of the data. While this may sound simple, numerical or Monte Carlo integration techniques are typically required, and the techniques used may have substantial impacts on resulting inferences.
Given estimates of \(\boldsymbol{\beta}\) and \({\bf D}\), predict the random effects as \[\widehat{{\bf b}}_i=E({\bf b}_i \mid {\bf Y}_i, \widehat{\boldsymbol{\beta}}, \widehat{{\bf D}}).\] Again, simple analytic solutions are rarely available.

Even when the burden of integration is not that great, the optimization problem may be difficult to solve.

Random Effects Logistic Regression

The National Institute of Mental Health (NIMH) conducted a study of three active treatments versus placebo among individuals with schizophrenia. Schizophrenia is a complex mental condition that makes it difficult to tell the difference between real and unreal experiences, to think logically, to have normal emotional responses, and to behave normally in social situations.

This Institute is the evil NIMH from the kids’ book and movie! movie

Our data combine all three active treatment groups. The outcome is coded 1-4, with 1 indicating normal mental health and increasing numbers indicating more severe illness.

Measures were taken weekly at baseline at three planned follow-up visits, though there was some variation in timing of the follow-up as shown below.

Group	Week 0	Week 1	Week 2	Week 3	Week 4	Week 5	Week 6
Placebo	107	105	5	87	2	2	70
Active Tx	327	321	9	287	9	7	265

Note there is significant loss to follow-up, with 65% of those on placebo completing the study and 81% of those on treatment completing the study. This issue of missing data could be problematic (maybe those on placebo are more likely to drop out due to feeling they are not improving).

The question of interest in the study is whether the response trajectory on active treatment differs from that on placebo.

Let \(t_j\) denote the square-root transformed week
Let \(x_i\) take value 1 if the participant is on active treatment and 0 otherwise (treatment is fixed for the duration of the study)
Let \(Y_{ij}\) take value 1 if the symptom severity is moderately ill or worse (levels 3 and 4 of the outcome) and take value 0 otherwise

We consider the model

\[\text{logit}(Pr(Y_{ij}=1))=\beta_0+\beta_1x_i+\beta_2t_j+\beta_3x_it_j+b_i\]

\(\beta_0\) is the log odds of response for a ‘typical’ sybject (\(b_i=0\)) at baseline (week 0) on placebo
\(\beta_1\) is the log odds ratio (OR) associated with a unit change in \(x_i\) for a given subject at baseline (given subject=same \(b_i\) value); it tells us about how a subject’s response probability changes if the value of \(x_i\) were to change (this is a randomized study, so \(\beta_1\) should be close to zero because before treatment is administered, the two groups should be similar)
\(\beta_2\) is the log OR describing the time trend of symptoms for those on placebo
\(\beta_3\) gets at the real treatment effect here; it tells us how the log OR over time differs for those on treatment versus those on placebo (\(\beta_2+\beta_3\) describes the time trend on treatment)
\(\text{Var}(b_i)\) tells us about the degree of heterogeneity across subjects that is not attributable to covariates

Estimation

ML estimation is carried out typically using adaptive Gaussian quadrature. To improve accuracy over many package defaults (Laplace approximation), increase the number of quadrature points to be greater than one. Note that some software packages require Laplace approximation with Gaussian quadrature if the number of random effects is more than 1 for the sake of computational efficiency.

library(lme4)
schiz=read.table(file='schiz.txt',header=T)
schiz$threefour=as.numeric(schiz$Y>2) #worst two categories
schiz$sqrtwk=sqrt(schiz$WK) #recommended by researchers
m1=glmer(threefour~TX+sqrtwk+TX*sqrtwk+(1|ID),family=binomial,data=schiz,nAGQ=50) #default nAGQ=1 is Laplace approx
schiz$slopeplac=schiz$sqrtwk*(1-schiz$TX)
schiz$slopetx=schiz$sqrtwk*schiz$TX
m2=glmer(threefour~TX+slopeplac+slopetx+(1|ID),family=binomial,data=schiz,nAGQ=50) #get slopes by trt

summary(m1)

## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
##  Family: binomial  ( logit )
## Formula: threefour ~ TX + sqrtwk + TX * sqrtwk + (1 | ID)
##    Data: schiz
## 
##      AIC      BIC   logLik deviance df.resid 
##   1578.7   1605.6   -784.4   1568.7     1598 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -9.2908 -0.3121  0.1076  0.3478  3.4756 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  ID     (Intercept) 5.608    2.368   
## Number of obs: 1603, groups:  ID, 437
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.5148     0.4444   7.909 2.59e-15 ***
## TX           -0.3570     0.4720  -0.756    0.449    
## sqrtwk       -1.2136     0.2115  -5.738 9.58e-09 ***
## TX:sqrtwk    -1.0538     0.2469  -4.268 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) TX     sqrtwk
## TX        -0.830              
## sqrtwk    -0.711  0.601       
## TX:sqrtwk  0.457 -0.670 -0.765

summary(m2)

## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
##  Family: binomial  ( logit )
## Formula: threefour ~ TX + slopeplac + slopetx + (1 | ID)
##    Data: schiz
## 
##      AIC      BIC   logLik deviance df.resid 
##   1578.7   1605.6   -784.4   1568.7     1598 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -9.2909 -0.3121  0.1076  0.3478  3.4756 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  ID     (Intercept) 5.608    2.368   
## Number of obs: 1603, groups:  ID, 437
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.5149     0.4444   7.909 2.60e-15 ***
## TX           -0.3570     0.4720  -0.756    0.449    
## slopeplac    -1.2136     0.2115  -5.738 9.59e-09 ***
## slopetx      -2.2674     0.1608 -14.104  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) TX     slpplc
## TX        -0.830              
## slopeplac -0.711  0.601       
## slopetx   -0.234 -0.239  0.141

Interpreting Results

Because the interaction term is significantly different from zero, we can say that for any given individual, active treatment is associated with lower odds of moderate to severe symptoms over time.
Because the main effect term for square root of week is significantly different from zero, we can say that placebo patients also saw a decline in symptoms over time.
In particular, for a given participant on treatment, an increase in 1 of square root of week (say, from week 0 to week 1), is associated with odds of having moderate to severe symptoms that are only \(\text{exp}(-2.2674)=0.10~~~~~~\) \((e^{-2.2674-1.96(0.1608)},e^{-2.2674+1.96(0.1608)})\)=(0.08,0.14) times the odds of symptoms before. The corresponding odds ratio for placebo patients is 0.30 (0.20, 0.45).

How do we characterize the size of the random effects variance \(\sigma^2\)?

\(\text{logit}(Pr(Y_{ij}=1))={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\) where \(b_i\sim N(0,\sigma^2)\)

Suppose \({\bf X}_{ij}'\boldsymbol{\beta}=0\).

If \(\sigma^2=1\), then a subject with \(b_i=0\) (‘typical’ subject) has \(Pr(Y=1)=\frac{\exp(0)}{1+\exp(0)}=0.5\).
A subject with \(b_i\) one standard deviation above the mean would have \(Pr(Y=1)=\frac{\exp(1)}{1+\exp(1)}=0.73\)
A subject with \(b_i\) one standard deviation below the mean would have \(Pr(Y=1)=\frac{\exp(-1)}{1+\exp(-1)}=0.27\).

So \(\sigma^2=1\) gives us large variation across subjects.

Intraclass Correlation

Consider an unobserved continuous variable \(W_{ij}\). \(W_{ij}\) is related to \(Y_{ij}\) in the following manner: \(Y_{ij}=1\) if \(W_{ij}<c\), and \(Y_{ij}=0\) otherwise.

The location of \(c\) and the distribution of \(W\) govern the probability that \(Y=1\).

Useful way of thinking about model but not an essential assumption: \[W_{ij}= {\bf X}_{ij}'\boldsymbol{\beta}+b_i+\varepsilon_{ij} \]

\(\varepsilon_{ij} \sim N(0,1)\): probit regression
\(\varepsilon_{ij} \sim\) standard logistic (mean 0, variance \(\frac{\pi^2}{3}\)): logistic regression

We can use this framework to calculate ICC’s:

\(ICC=\frac{\sigma^2}{\sigma^2+1} ~~~ \text{for probit}\)
\(ICC=\frac{\sigma^2}{\sigma^2+\frac{\pi^2}{3}} ~~~ \text{for logistic}\)

What is the ICC in the schizophrenia data?

We next will consider time trends for a variety of subjects in the study, starting with a typical (\(b_i=0\)) subject.

timevec=seq(0,sqrt(6),by=.1); t2=timevec*timevec;
plotnull=rep(0,length(t2))
predplac0=fixef(m1)[1]+fixef(m1)[3]*timevec
predtrt0=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec
probplac0=exp(predplac0)/(1+exp(predplac0))
probtrt0=exp(predtrt0)/(1+exp(predtrt0))
par(mfrow=c(1,3))

plot(timevec,plotnull,type='n',ylim=c(-2.2,.5),xlab='Sqrt(Weeks)',ylab='Logit',main='Typical subject')
lines(timevec,predplac0,col=2)
lines(timevec,predtrt0,col=3)


plot(timevec,plotnull,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',
  ylab='Probability',main='Typical subject')
lines(timevec,probplac0,col=2)
lines(timevec,probtrt0,col=3)

plot(t2,plotnull,type='n',ylim=c(0,1),xlab='Weeks',
  ylab='Probability',main='Typical subject')
lines(t2,probplac0,col=2)
lines(t2,probtrt0,col=3)

Red: placebo and green: treatment

Now let’s look at a subject with \(b_i\) 1 SD above the mean and a subject with \(b_i\) 1 SD below the mean.

predplac1up=fixef(m1)[1]+fixef(m1)[3]*timevec+sqrt(VarCorr(m1)$ID[,1])
predtrt1up=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec+sqrt(VarCorr(m1)$ID[,1])
probplac1up=exp(predplac1up)/(1+exp(predplac1up))
probtrt1up=exp(predtrt1up)/(1+exp(predtrt1up))

predplac1dn=fixef(m1)[1]+fixef(m1)[3]*timevec-sqrt(VarCorr(m1)$ID[,1])
predtrt1dn=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec-sqrt(VarCorr(m1)$ID[,1])
probplac1dn=exp(predplac1dn)/(1+exp(predplac1dn))
probtrt1dn=exp(predtrt1dn)/(1+exp(predtrt1dn))

par(mfrow=c(1,2))


plot(timevec,predplac1up,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',ylab='Probability',main='1SD Above Mean')
lines(timevec,probplac1up,col=2)
lines(timevec,probtrt1up,col=3)

plot(timevec,plotnull,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',
  ylab='Probability',main='1 SD Below Mean')
lines(timevec,probplac1dn,col=2)
lines(timevec,probtrt1dn,col=3)
legend(0,2,c('Plac','Trt'),col=c(2,3),lty=c(1,1))

With multiple random effects, the glmer function forces us to use Laplace approximation (e.g., 1 quadrature point). I tweaked the optimizer a bit to get convergence here - you can read more about the BOBYQA optimizer at your leisure.

m2=glmer(threefour~TX+sqrtwk+TX*sqrtwk+(1+sqrtwk|ID),family=binomial,
         control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)),data=schiz)
summary(m2)

How do we interpret the output of this model?

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: threefour ~ TX + sqrtwk + TX * sqrtwk + (1 + sqrtwk | ID)
##    Data: schiz
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##   1592.1   1629.7   -789.0   1578.1     1596 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.15875 -0.29682  0.09688  0.32814  2.46326 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr 
##  ID     (Intercept) 6.5314   2.5557        
##         sqrtwk      0.9793   0.9896   -0.28
## Number of obs: 1603, groups:  ID, 437
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.7749     0.5525   6.832 8.37e-12 ***
## TX           -0.2974     0.4948  -0.601    0.548    
## sqrtwk       -1.2200     0.2834  -4.305 1.67e-05 ***
## TX:sqrtwk    -1.3452     0.3250  -4.139 3.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) TX     sqrtwk
## TX        -0.716              
## sqrtwk    -0.741  0.552       
## TX:sqrtwk  0.250 -0.614 -0.633

Case Study: Berkeley Admissions

In fall 1973, the University of California, Berkeley’s graduate division admitted 44% of male applicants and 35% of female applicants. School administrators were concerned about the potential for bias (and lawsuits!) and asked statistics professor Peter Bickel to examine the data more carefully.

We have a subset of the admissions data for 6 departments. Previously, we used a fixed effects model to analyze the data. Now we’ll use a random effects approach, and we’ll also look at coding of binomial models with grouped data (efficient for data storage and perhaps helpful in election case study).

library(rethinking)
data(UCBadmit)
d <- UCBadmit
detach(package:rethinking,unload=T)
library(tidyverse)
library(brms)
d <-
  d%>%
  mutate(male=ifelse(applicant.gender=="male",1,0),
         dept_id = rep(1:6, each = 2))
d$successrate=d$admit/d$applications
sum(d$admit[d$male==1])/sum(d$applications[d$male==1])

## [1] 0.4451877

sum(d$admit[d$male==0])/sum(d$applications[d$male==0])

## [1] 0.3035422

We see in this subset of departments that roughly 45% of male applicants were admitted, while only 30% of female applicants were admitted.

Because admissions decisions for graduate school are made on a departmental level (not at the school level), it makes sense to examine results of applications by department. Note the data are in grouped format (counts by gender and department) instead of in individual-level (one line per applicant) format.

d[,c(1,2,3,4,7)]

##    dept applicant.gender admit reject dept_id
## 1     A             male   512    313       1
## 2     A           female    89     19       1
## 3     B             male   353    207       2
## 4     B           female    17      8       2
## 5     C             male   120    205       3
## 6     C           female   202    391       3
## 7     D             male   138    279       4
## 8     D           female   131    244       4
## 9     E             male    53    138       5
## 10    E           female    94    299       5
## 11    F             male    22    351       6
## 12    F           female    24    317       6

Hmm, what’s going on here?

Following McElreath’s analysis in Statistical Rethinking, we start fitting a simple logistic regression model and examine diagnostic measures.

The model for department \(i\) and gender \(j\) with \(n_{admit,ij}\) of \(n_{ij}\) applicants admitted is given as:

\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha+\beta\text{male}_{j}\)

\(\alpha \sim N(0,3^2)\) and \(\beta \sim N(0,1)\)

Prior motivation: note the intercept and slope are typically on different scales. The intercept prior is fairly flat. For the slope prior, note that \(e^2=7.4\) and \(e^{-2}=0.14\), and we really don’t expect the male effect to be stronger than this (roughly 95% of the prior mass is between odds ratios of 0.14 and 7.4).

adm1 <-
  brm(data = d, family = binomial,
      admit | trials(applications) ~ 1 + male ,
      prior = c(prior(normal(0, 3), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2500, warmup = 500, cores = 2, chains = 2,
      seed = 10)
summary(adm1)

##  Family: binomial 
##   Links: mu = logit 
## Formula: admit | trials(applications) ~ 1 + male 
##    Data: d (Number of observations: 12) 
## Samples: 2 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.83      0.05    -0.93    -0.73 1.00     2182     2417
## male          0.61      0.06     0.48     0.73 1.00     2744     2699
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Here it appears male applicants have \(e^{0.61}=1.8\) (95% credible interval (1.6, 2.1)) times the odds of admission as female applicants.

We can also put this on the probability scale.

post <- posterior_samples(adm1)

post %>%
  mutate(p_admit_male   = inv_logit_scaled(b_Intercept + b_male),
         p_admit_female = inv_logit_scaled(b_Intercept),
         diff_admit     = p_admit_male - p_admit_female) %>%
  summarise(`2.5%`  = quantile(diff_admit, probs = .025),
            `50%`   = median(diff_admit),
            `97.5%` = quantile(diff_admit, probs = .975))

##        2.5%      50%     97.5%
## 1 0.1122165 0.141135 0.1695555

Overall it appears the median probability of admission was 14 percentage points higher for males.

Model Checking

Here we take some posterior predictions and plot against the observed proportions in the data.

library(dutchmasters)
library(ggplot2)
d <-
  d %>%
  mutate(case = factor(1:12))

p <- 
  predict(adm1) %>% 
  as_tibble() %>% 
  bind_cols(d)

d_text <-
  d %>%
  group_by(dept) %>%
  summarise(case  = mean(as.numeric(case)),
            admit = mean(admit / applications) + .05)

ggplot(data = d, aes(x = case, y = admit / applications)) +
  geom_pointrange(data = p, 
                  aes(y    = Estimate / applications,
                      ymin = Q2.5     / applications ,
                      ymax = Q97.5    / applications),
                  color = '#2F4F4F',
                  shape = 1, alpha = 1/3) +
  geom_point(color = #b34a00') +
  geom_line(aes(group = dept),
            color = #b34a00') +
  geom_text(data = d_text,
            aes(y = admit, label = dept),
            color = #b34a00',
            family = "serif") +
  coord_cartesian(ylim = 0:1) +
  labs(y     = "Proportion admitted",
       title = "Posterior validation check") +
  theme(axis.ticks.x = element_blank())

The orange lines connect observed proportions admitted in each department (odd numbers indicate males; even females). The grey circles indicate point and interval estimates of the model-predicted proportion admitted. Clearly the model fits the data poorly.

Varying Intercepts

Based on the plot, we have some big departmental differences. Let’s specify department as a random effect in the model.

\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha_{0i}+\beta\text{male}_{j}\) \(\alpha_{0i} \sim N(\alpha,\sigma) ~~~ \sigma \sim \text{HalfCauchy}(0,1)\) \(\alpha \sim N(0,3^2)\) and \(\beta \sim N(0,1)\)

adm2 <- 
  brm(data = d, family = binomial,
      admit | trials(applications) ~ 1 + male + (1 | dept_id),
      prior = c(prior(normal(0, 3), class = Intercept),
                prior(normal(0, 1), class = b),
                prior(cauchy(0, 1), class = sd)),
      iter = 4500, warmup = 500, chains = 3, cores = 3,
      seed = 13,
      control = list(adapt_delta = 0.99))

## Compiling Stan program...

## Start sampling

## Inference for Stan model: cfc03ce4b5376496b22da527429eff69.
## 3 chains, each with iter=4500; warmup=500; thin=1; 
## post-warmup draws per chain=4000, total post-warmup draws=12000.
## 
##                          mean se_mean   sd   2.5%    25%    50%    75%  97.5%
## b_Intercept             -0.58    0.01 0.58  -1.74  -0.92  -0.57  -0.23   0.60
## b_male                  -0.09    0.00 0.08  -0.25  -0.15  -0.09  -0.04   0.06
## sd_dept_id__Intercept    1.36    0.01 0.49   0.75   1.04   1.25   1.56   2.62
## r_dept_id[1,Intercept]   1.25    0.01 0.58   0.08   0.91   1.24   1.59   2.43
## r_dept_id[2,Intercept]   1.20    0.01 0.58   0.02   0.86   1.20   1.55   2.38
## r_dept_id[3,Intercept]  -0.01    0.01 0.58  -1.18  -0.35  -0.02   0.34   1.17
## r_dept_id[4,Intercept]  -0.04    0.01 0.58  -1.22  -0.39  -0.05   0.30   1.13
## r_dept_id[5,Intercept]  -0.48    0.01 0.58  -1.68  -0.83  -0.49  -0.14   0.68
## r_dept_id[6,Intercept]  -2.03    0.01 0.59  -3.23  -2.38  -2.03  -1.68  -0.86
## lp__                   -60.93    0.05 2.53 -66.96 -62.41 -60.57 -59.08 -57.07
##                        n_eff Rhat
## b_Intercept             1973    1
## b_male                  4652    1
## sd_dept_id__Intercept   2071    1
## r_dept_id[1,Intercept]  1983    1
## r_dept_id[2,Intercept]  1994    1
## r_dept_id[3,Intercept]  1977    1
## r_dept_id[4,Intercept]  1980    1
## r_dept_id[5,Intercept]  1999    1
## r_dept_id[6,Intercept]  2053    1
## lp__                    2404    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Oct  5 08:26:05 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

In this model we see no evidence of a difference in admissions probabilities by gender though we do see big departmental variability.

Let’s evaluate whether we need a random effect for gender.

adm3 <- 
  brm(data = d, family = binomial,
      admit | trials(applications) ~ 1 + male + (1 + male | dept_id),
      prior = c(prior(normal(0, 3), class = Intercept),
                prior(normal(0, 1), class = b),
                prior(cauchy(0, 1), class = sd),
                prior(lkj(2), class = cor)),
      iter = 5000, warmup = 1000, chains = 4, cores = 4,
      seed = 13,
      control = list(adapt_delta = .99,
                     max_treedepth = 12))
adm3$fit

## Compiling Stan program...

## Start sampling

## Inference for Stan model: 2c08311663aa21c2a5d74151615de975.
## 4 chains, each with iter=5000; warmup=1000; thin=1; 
## post-warmup draws per chain=4000, total post-warmup draws=16000.
## 
##                                mean se_mean   sd   2.5%    25%    50%    75%
## b_Intercept                   -0.49    0.01 0.66  -1.81  -0.88  -0.49  -0.10
## b_male                        -0.16    0.00 0.22  -0.61  -0.29  -0.15  -0.03
## sd_dept_id__Intercept          1.55    0.01 0.57   0.85   1.18   1.43   1.77
## sd_dept_id__male               0.46    0.00 0.23   0.14   0.31   0.42   0.56
## cor_dept_id__Intercept__male  -0.32    0.00 0.34  -0.86  -0.58  -0.36  -0.09
## r_dept_id[1,Intercept]         1.77    0.01 0.69   0.41   1.35   1.77   2.19
## r_dept_id[2,Intercept]         1.22    0.01 0.71  -0.17   0.78   1.22   1.65
## r_dept_id[3,Intercept]        -0.16    0.01 0.66  -1.48  -0.55  -0.15   0.23
## r_dept_id[4,Intercept]        -0.13    0.01 0.66  -1.44  -0.53  -0.12   0.26
## r_dept_id[5,Intercept]        -0.64    0.01 0.67  -1.96  -1.04  -0.63  -0.24
## r_dept_id[6,Intercept]        -2.11    0.01 0.68  -3.47  -2.52  -2.10  -1.70
## r_dept_id[1,male]             -0.61    0.00 0.31  -1.29  -0.80  -0.59  -0.40
## r_dept_id[2,male]             -0.04    0.00 0.34  -0.72  -0.25  -0.04   0.16
## r_dept_id[3,male]              0.24    0.00 0.25  -0.22   0.08   0.22   0.39
## r_dept_id[4,male]              0.07    0.00 0.24  -0.42  -0.08   0.06   0.21
## r_dept_id[5,male]              0.27    0.00 0.27  -0.22   0.10   0.26   0.44
## r_dept_id[6,male]              0.04    0.00 0.31  -0.57  -0.15   0.04   0.23
## lp__                         -65.40    0.06 3.75 -73.93 -67.62 -64.97 -62.75
##                               97.5% n_eff Rhat
## b_Intercept                    0.83  3945    1
## b_male                         0.28  6930    1
## sd_dept_id__Intercept          2.95  4826    1
## sd_dept_id__male               1.03  5744    1
## cor_dept_id__Intercept__male   0.43 11185    1
## r_dept_id[1,Intercept]         3.16  4189    1
## r_dept_id[2,Intercept]         2.63  4555    1
## r_dept_id[3,Intercept]         1.17  3976    1
## r_dept_id[4,Intercept]         1.21  3980    1
## r_dept_id[5,Intercept]         0.69  3999    1
## r_dept_id[6,Intercept]        -0.76  4182    1
## r_dept_id[1,male]             -0.06  8219    1
## r_dept_id[2,male]              0.64 12811    1
## r_dept_id[3,male]              0.76  7727    1
## r_dept_id[4,male]              0.57  7666    1
## r_dept_id[5,male]              0.84  8835    1
## r_dept_id[6,male]              0.64 11681    1
## lp__                         -59.32  3360    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Oct  5 08:37:59 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Before we get too excited let’s look at some diagnostics.

post <- posterior_samples(adm3, add_chain = T)

post %>% 
  select(-lp__) %>% 
  gather(key, value, -chain, -iter) %>% 
  mutate(chain = as.character(chain)) %>% 

  ggplot(aes(x = iter, y = value, group = chain, color = chain)) +
  geom_line(size = 1/15) +
  scale_color_manual(values = c("#80A0C7", "#B1934A", "#A65141", "#EEDA9D")) +
  scale_x_continuous(NULL, breaks = c(1001, 5000)) +
  ylab(NULL) +
  theme_pearl_earring +
  theme(legend.position  = c(.825, .06),
        legend.direction = "horizontal") +
  facet_wrap(~key, ncol = 3, scales = "free_y")

rbind(coef(adm3)$dept_id[, , 1],
      coef(adm3)$dept_id[, , 2]) %>% 
  as_tibble() %>% 
  mutate(param   = c(paste("Intercept", 1:6), paste("male", 1:6)),
         reorder = c(6:1, 12:7)) %>% 

  # plot
  ggplot(aes(x = reorder(param, reorder))) +
  geom_hline(yintercept = 0, linetype = 3, color = "#8B9DAF") +
  geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5, y = Estimate, color = reorder < 7),
                  shape = 20, size = 3/4) +
  scale_color_manual(values = c("#394165", "#A65141")) +
  xlab(NULL) +
  coord_flip() +
  theme_pearl_earring +
  theme(legend.position = "none",
        axis.ticks.y    = element_blank(),
        axis.text.y     = element_text(hjust = 0))

We see much more variability in the random intercepts than in the random slopes.

What happened at Berkeley?

What happened at Berkeley? It actually doesn’t require too much sophisticated modeling. What we’re seeing is Simpson’s paradox.

d[,c(1,2,3,4,8)]

##    dept applicant.gender admit reject successrate
## 1     A             male   512    313  0.62060606
## 2     A           female    89     19  0.82407407
## 3     B             male   353    207  0.63035714
## 4     B           female    17      8  0.68000000
## 5     C             male   120    205  0.36923077
## 6     C           female   202    391  0.34064081
## 7     D             male   138    279  0.33093525
## 8     D           female   131    244  0.34933333
## 9     E             male    53    138  0.27748691
## 10    E           female    94    299  0.23918575
## 11    F             male    22    351  0.05898123
## 12    F           female    24    317  0.07038123

In the raw data, women had higher acceptance probabilities in 4 of the 6 departments. However, the departments to which they applied in higher numbers were the departments that had lower overall acceptance rates.

What happened is that women were more likely to apply do departments like English, which have trouble supporting grad students, and they were less likely to apply to STEM departments, which had more plentiful funding for graduate students. The men, on the other hand, were much more likely to apply to the STEM departments that had higher acceptance rates.