We now discuss the incorporation of random effects into generalized linear models. The basic idea is that we assume there is natural heterogeneity across groups in a subset of the regression coefficients. These coefficients are assumed to vary across groups according to some distribution.
Conditional on the random effects, we assume the responses for a single subject are independent observations from a distribution in the exponential family.
In the GLMM, we assume the conditional distribution of each \(Y_{ij}\), conditional on \({\bf b}_i\), belongs to the exponential family with conditional mean \[g(E[Y_{ij} \mid {\bf b}_i])={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i,\] where \(g(\cdot)\) is a known link function.
Assume the \({\bf b}_i\) are independent across subjects with \({\bf b}_i \sim N({\bf 0}, {\bf D})\). We also assume that given \({\bf b}_i\), the responses \(Y_{i1}, \ldots, Y_{in}\) are mutually independent.
\[Y_{ij}={\bf X}_{ij}'\boldsymbol{\beta}+b_i + \varepsilon_{ij},\] where \(b_i\overset{iid}\sim N(0,\sigma_b^2) \perp \varepsilon_{ij} \overset{iid}\sim N(0,\sigma_e^2)\)
and \(E(Y_{ij} \mid b_i)={\bf X}_{ij}'\boldsymbol{\beta}+b_i\) and \(E(Y_{ij}) = {\bf X}_{ij}'\boldsymbol{\beta}\)
An example of this is the radon study we previously discussed, where our response was radon level (logged), elements of \({\bf X}\) (fixed effects) included the floor of the home (in some models) and the uranium content of the soil. We had a random effect \(b_i\) for county (and in an extended model added a random slope for floor).
\[\text{logit}(E(Y_{ij} \mid b_i))={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\] where \(b_i\sim N(0,\sigma^2)\)
Question: what happened to \(\varepsilon_{ij}\)?
\(\log(E(Y_{ij} \mid {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i,\) with \({\bf X}_{ij}={\bf Z}_{ij}=[1,t_{ij}]\) (random slopes and intercepts) and \({\bf b}_i \sim N({\bf 0}, {\bf D})\).
In the model \[\text{logit}(E[Y_{ij} \mid b_i])={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\] with \(b_i \sim N(0,\sigma^2)\), each element of \(\boldsymbol{\beta}\) measures the change in the log odds of a ‘positive’ response per unit change in the respective covariate, in a specific group that has an underlying propensity to respond positively given by \(b_i\).
That is, we need to hold the group membership constant when interpreting \(\beta_k\), just as we would hold the values of \(\bf{x}_{i,(-k)}\) constant when interpreting \(\beta_k\)
Note also that with a non-linear link function, a non-linear contrast of the averages is not equal to the average of non-linear contrasts, so that the parameters do not in general have population-average interpretations (as they would in a linear mixed effects model, which has identity link). So while in the lmm \[g(E(Y_{ij} \mid {\bf X}_{ij}, {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i\] so that \(E(Y_{ij} \mid {\bf X}_{ij})={\bf X}_{ij}'\boldsymbol{\beta}\), when \(g(\cdot)\) is non-linear (say the logit), then \[g(E(Y_{ij} \mid {\bf X}_{ij}))\neq {\bf X}_{ij}'\boldsymbol{\beta}\] for all \(\boldsymbol{\beta}\) when averaged over the distribution of the random effects.
The joint probability density function is given by \[f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i).\] However, recall that the \({\bf b}_i\) are unobserved. How then do we handle the \({\bf b}_i\) in the maximization?
Typically, we base frequentist inferences on the marginal (integrated) likelihood function, given by \[\prod_{i=1}^N \int f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i) d{\bf b}_i.\] Estimation using maximum likelihood involves a two-step procedure.
Obtain ML estimates of \(\boldsymbol{\beta}\) and \({\bf D}\) based on the marginal likelihood of the data. While this may sound simple, numerical or Monte Carlo integration techniques are typically required, and the techniques used may have substantial impacts on resulting inferences.
Given estimates of \(\boldsymbol{\beta}\) and \({\bf D}\), predict the random effects as \[\widehat{{\bf b}}_i=E({\bf b}_i \mid {\bf Y}_i, \widehat{\boldsymbol{\beta}}, \widehat{{\bf D}}).\] Again, simple analytic solutions are rarely available.
Even when the burden of integration is not that great, the optimization problem may be difficult to solve.
The National Institute of Mental Health (NIMH) conducted a study of three active treatments versus placebo among individuals with schizophrenia. Schizophrenia is a complex mental condition that makes it difficult to tell the difference between real and unreal experiences, to think logically, to have normal emotional responses, and to behave normally in social situations.
This Institute is the evil NIMH from the kids’ book and movie!
Our data combine all three active treatment groups. The outcome is coded 1-4, with 1 indicating normal mental health and increasing numbers indicating more severe illness.
Measures were taken weekly at baseline at three planned follow-up visits, though there was some variation in timing of the follow-up as shown below.
Group | Week 0 | Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 |
---|---|---|---|---|---|---|---|
Placebo | 107 | 105 | 5 | 87 | 2 | 2 | 70 |
Active Tx | 327 | 321 | 9 | 287 | 9 | 7 | 265 |
Note there is significant loss to follow-up, with 65% of those on placebo completing the study and 81% of those on treatment completing the study. This issue of missing data could be problematic (maybe those on placebo are more likely to drop out due to feeling they are not improving).
The question of interest in the study is whether the response trajectory on active treatment differs from that on placebo.
We consider the model
\[\text{logit}(Pr(Y_{ij}=1))=\beta_0+\beta_1x_i+\beta_2t_j+\beta_3x_it_j+b_i\]
\[\text{logit}(Pr(Y_{ij}=1))=\beta_0+\beta_1x_i+\beta_2t_j+\beta_3x_it_j+b_i\]
ML estimation is carried out typically using adaptive Gaussian quadrature. To improve accuracy over many package defaults (Laplace approximation), increase the number of quadrature points to be greater than one. Note that some software packages require Laplace approximation with Gaussian quadrature if the number of random effects is more than 1 for the sake of computational efficiency.
library(lme4)
schiz=read.table(file='schiz.txt',header=T)
schiz$threefour=as.numeric(schiz$Y>2) #worst two categories
schiz$sqrtwk=sqrt(schiz$WK) #recommended by researchers
m1=glmer(threefour~TX+sqrtwk+TX*sqrtwk+(1|ID),family=binomial,data=schiz,nAGQ=50) #default nAGQ=1 is Laplace approx
schiz$slopeplac=schiz$sqrtwk*(1-schiz$TX)
schiz$slopetx=schiz$sqrtwk*schiz$TX
m2=glmer(threefour~TX+slopeplac+slopetx+(1|ID),family=binomial,data=schiz,nAGQ=50) #get slopes by trt
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
## Family: binomial ( logit )
## Formula: threefour ~ TX + sqrtwk + TX * sqrtwk + (1 | ID)
## Data: schiz
##
## AIC BIC logLik deviance df.resid
## 1578.7 1605.6 -784.4 1568.7 1598
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -9.2908 -0.3121 0.1076 0.3478 3.4756
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 5.608 2.368
## Number of obs: 1603, groups: ID, 437
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5148 0.4444 7.909 2.59e-15 ***
## TX -0.3570 0.4720 -0.756 0.449
## sqrtwk -1.2136 0.2115 -5.738 9.58e-09 ***
## TX:sqrtwk -1.0538 0.2469 -4.268 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) TX sqrtwk
## TX -0.830
## sqrtwk -0.711 0.601
## TX:sqrtwk 0.457 -0.670 -0.765
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
## Family: binomial ( logit )
## Formula: threefour ~ TX + slopeplac + slopetx + (1 | ID)
## Data: schiz
##
## AIC BIC logLik deviance df.resid
## 1578.7 1605.6 -784.4 1568.7 1598
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -9.2909 -0.3121 0.1076 0.3478 3.4756
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 5.608 2.368
## Number of obs: 1603, groups: ID, 437
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5149 0.4444 7.909 2.60e-15 ***
## TX -0.3570 0.4720 -0.756 0.449
## slopeplac -1.2136 0.2115 -5.738 9.59e-09 ***
## slopetx -2.2674 0.1608 -14.104 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) TX slpplc
## TX -0.830
## slopeplac -0.711 0.601
## slopetx -0.234 -0.239 0.141
How do we characterize the size of the random effects variance \(\sigma^2\)?
\(\text{logit}(Pr(Y_{ij}=1))={\bf X}_{ij}'\boldsymbol{\beta}+b_i,\) where \(b_i\sim N(0,\sigma^2)\)
Suppose \({\bf X}_{ij}'\boldsymbol{\beta}=0\).
So \(\sigma^2=1\) gives us large variation across subjects.
Consider an unobserved continuous variable \(W_{ij}\). \(W_{ij}\) is related to \(Y_{ij}\) in the following manner: \(Y_{ij}=1\) if \(W_{ij}<c\), and \(Y_{ij}=0\) otherwise.
The location of \(c\) and the distribution of \(W\) govern the probability that \(Y=1\).
Useful way of thinking about model but not an essential assumption: \[W_{ij}= {\bf X}_{ij}'\boldsymbol{\beta}+b_i+\varepsilon_{ij} \]
We can use this framework to calculate ICC’s:
\(ICC=\frac{\sigma^2}{\sigma^2+1} ~~~ \text{for probit}\)
\(ICC=\frac{\sigma^2}{\sigma^2+\frac{\pi^2}{3}} ~~~ \text{for logistic}\)
What is the ICC in the schizophrenia data?
We next will consider time trends for a variety of subjects in the study, starting with a typical (\(b_i=0\)) subject.
timevec=seq(0,sqrt(6),by=.1); t2=timevec*timevec;
plotnull=rep(0,length(t2))
predplac0=fixef(m1)[1]+fixef(m1)[3]*timevec
predtrt0=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec
probplac0=exp(predplac0)/(1+exp(predplac0))
probtrt0=exp(predtrt0)/(1+exp(predtrt0))
par(mfrow=c(1,3))
plot(timevec,plotnull,type='n',ylim=c(-2.2,.5),xlab='Sqrt(Weeks)',ylab='Logit',main='Typical subject')
lines(timevec,predplac0,col=2)
lines(timevec,predtrt0,col=3)
plot(timevec,plotnull,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',
ylab='Probability',main='Typical subject')
lines(timevec,probplac0,col=2)
lines(timevec,probtrt0,col=3)
plot(t2,plotnull,type='n',ylim=c(0,1),xlab='Weeks',
ylab='Probability',main='Typical subject')
lines(t2,probplac0,col=2)
lines(t2,probtrt0,col=3)
Red: placebo and green: treatment
Now let’s look at a subject with \(b_i\) 1 SD above the mean and a subject with \(b_i\) 1 SD below the mean.
predplac1up=fixef(m1)[1]+fixef(m1)[3]*timevec+sqrt(VarCorr(m1)$ID[,1])
predtrt1up=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec+sqrt(VarCorr(m1)$ID[,1])
probplac1up=exp(predplac1up)/(1+exp(predplac1up))
probtrt1up=exp(predtrt1up)/(1+exp(predtrt1up))
predplac1dn=fixef(m1)[1]+fixef(m1)[3]*timevec-sqrt(VarCorr(m1)$ID[,1])
predtrt1dn=fixef(m1)[1]+fixef(m1)[2]+(fixef(m1)[3]+fixef(m1)[4])*timevec-sqrt(VarCorr(m1)$ID[,1])
probplac1dn=exp(predplac1dn)/(1+exp(predplac1dn))
probtrt1dn=exp(predtrt1dn)/(1+exp(predtrt1dn))
par(mfrow=c(1,2))
plot(timevec,predplac1up,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',ylab='Probability',main='1SD Above Mean')
lines(timevec,probplac1up,col=2)
lines(timevec,probtrt1up,col=3)
plot(timevec,plotnull,type='n',ylim=c(0,1),xlab='Sqrt(Weeks)',
ylab='Probability',main='1 SD Below Mean')
lines(timevec,probplac1dn,col=2)
lines(timevec,probtrt1dn,col=3)
legend(0,2,c('Plac','Trt'),col=c(2,3),lty=c(1,1))
With multiple random effects, the glmer function forces us to use Laplace approximation (e.g., 1 quadrature point). I tweaked the optimizer a bit to get convergence here - you can read more about the BOBYQA optimizer at your leisure.
How do we interpret the output of this model?
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: threefour ~ TX + sqrtwk + TX * sqrtwk + (1 + sqrtwk | ID)
## Data: schiz
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## 1592.1 1629.7 -789.0 1578.1 1596
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.15875 -0.29682 0.09688 0.32814 2.46326
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 6.5314 2.5557
## sqrtwk 0.9793 0.9896 -0.28
## Number of obs: 1603, groups: ID, 437
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7749 0.5525 6.832 8.37e-12 ***
## TX -0.2974 0.4948 -0.601 0.548
## sqrtwk -1.2200 0.2834 -4.305 1.67e-05 ***
## TX:sqrtwk -1.3452 0.3250 -4.139 3.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) TX sqrtwk
## TX -0.716
## sqrtwk -0.741 0.552
## TX:sqrtwk 0.250 -0.614 -0.633
In fall 1973, the University of California, Berkeley’s graduate division admitted 44% of male applicants and 35% of female applicants. School administrators were concerned about the potential for bias (and lawsuits!) and asked statistics professor Peter Bickel to examine the data more carefully.
We have a subset of the admissions data for 6 departments. Previously, we used a fixed effects model to analyze the data. Now we’ll use a random effects approach, and we’ll also look at coding of binomial models with grouped data (efficient for data storage and perhaps helpful in election case study).
library(rethinking)
data(UCBadmit)
d <- UCBadmit
detach(package:rethinking,unload=T)
library(tidyverse)
library(brms)
d <-
d%>%
mutate(male=ifelse(applicant.gender=="male",1,0),
dept_id = rep(1:6, each = 2))
d$successrate=d$admit/d$applications
sum(d$admit[d$male==1])/sum(d$applications[d$male==1])
## [1] 0.4451877
## [1] 0.3035422
We see in this subset of departments that roughly 45% of male applicants were admitted, while only 30% of female applicants were admitted.
Because admissions decisions for graduate school are made on a departmental level (not at the school level), it makes sense to examine results of applications by department. Note the data are in grouped format (counts by gender and department) instead of in individual-level (one line per applicant) format.
## dept applicant.gender admit reject dept_id
## 1 A male 512 313 1
## 2 A female 89 19 1
## 3 B male 353 207 2
## 4 B female 17 8 2
## 5 C male 120 205 3
## 6 C female 202 391 3
## 7 D male 138 279 4
## 8 D female 131 244 4
## 9 E male 53 138 5
## 10 E female 94 299 5
## 11 F male 22 351 6
## 12 F female 24 317 6
Hmm, what’s going on here?
Following McElreath’s analysis in Statistical Rethinking, we start fitting a simple logistic regression model and examine diagnostic measures.
The model for department \(i\) and gender \(j\) with \(n_{admit,ij}\) of \(n_{ij}\) applicants admitted is given as:
\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha+\beta\text{male}_{j}\)
\(\alpha \sim N(0,3^2)\) and \(\beta \sim N(0,1)\)
Prior motivation: note the intercept and slope are typically on different scales. The intercept prior is fairly flat. For the slope prior, note that \(e^2=7.4\) and \(e^{-2}=0.14\), and we really don’t expect the male effect to be stronger than this (roughly 95% of the prior mass is between odds ratios of 0.14 and 7.4).
adm1 <-
brm(data = d, family = binomial,
admit | trials(applications) ~ 1 + male ,
prior = c(prior(normal(0, 3), class = Intercept),
prior(normal(0, 1), class = b)),
iter = 2500, warmup = 500, cores = 2, chains = 2,
seed = 10)
summary(adm1)
## Family: binomial
## Links: mu = logit
## Formula: admit | trials(applications) ~ 1 + male
## Data: d (Number of observations: 12)
## Samples: 2 chains, each with iter = 2500; warmup = 500; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.83 0.05 -0.93 -0.73 1.00 2182 2417
## male 0.61 0.06 0.48 0.73 1.00 2744 2699
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Here it appears male applicants have \(e^{0.61}=1.8\) (95% credible interval (1.6, 2.1)) times the odds of admission as female applicants.
We can also put this on the probability scale.
post <- posterior_samples(adm1)
post %>%
mutate(p_admit_male = inv_logit_scaled(b_Intercept + b_male),
p_admit_female = inv_logit_scaled(b_Intercept),
diff_admit = p_admit_male - p_admit_female) %>%
summarise(`2.5%` = quantile(diff_admit, probs = .025),
`50%` = median(diff_admit),
`97.5%` = quantile(diff_admit, probs = .975))
## 2.5% 50% 97.5%
## 1 0.1122165 0.141135 0.1695555
Overall it appears the median probability of admission was 14 percentage points higher for males.
Here we take some posterior predictions and plot against the observed proportions in the data.
library(dutchmasters)
library(ggplot2)
d <-
d %>%
mutate(case = factor(1:12))
p <-
predict(adm1) %>%
as_tibble() %>%
bind_cols(d)
d_text <-
d %>%
group_by(dept) %>%
summarise(case = mean(as.numeric(case)),
admit = mean(admit / applications) + .05)
ggplot(data = d, aes(x = case, y = admit / applications)) +
geom_pointrange(data = p,
aes(y = Estimate / applications,
ymin = Q2.5 / applications ,
ymax = Q97.5 / applications),
color = '#2F4F4F',
shape = 1, alpha = 1/3) +
geom_point(color = #b34a00') +
geom_line(aes(group = dept),
color = #b34a00') +
geom_text(data = d_text,
aes(y = admit, label = dept),
color = #b34a00',
family = "serif") +
coord_cartesian(ylim = 0:1) +
labs(y = "Proportion admitted",
title = "Posterior validation check") +
theme(axis.ticks.x = element_blank())
The orange lines connect observed proportions admitted in each department (odd numbers indicate males; even females). The grey circles indicate point and interval estimates of the model-predicted proportion admitted. Clearly the model fits the data poorly.
Based on the plot, we have some big departmental differences. Let’s specify department as a random effect in the model.
\(n_{admit,ij} \sim \text{Binomial}(n_{ij},p_{ij})~~~\) \(\text{logit}(p_{ij})=\alpha_{0i}+\beta\text{male}_{j}\) \(\alpha_{0i} \sim N(\alpha,\sigma) ~~~ \sigma \sim \text{HalfCauchy}(0,1)\) \(\alpha \sim N(0,3^2)\) and \(\beta \sim N(0,1)\)
adm2 <-
brm(data = d, family = binomial,
admit | trials(applications) ~ 1 + male + (1 | dept_id),
prior = c(prior(normal(0, 3), class = Intercept),
prior(normal(0, 1), class = b),
prior(cauchy(0, 1), class = sd)),
iter = 4500, warmup = 500, chains = 3, cores = 3,
seed = 13,
control = list(adapt_delta = 0.99))
## Compiling Stan program...
## Start sampling
## Inference for Stan model: cfc03ce4b5376496b22da527429eff69.
## 3 chains, each with iter=4500; warmup=500; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=12000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## b_Intercept -0.58 0.01 0.58 -1.74 -0.92 -0.57 -0.23 0.60
## b_male -0.09 0.00 0.08 -0.25 -0.15 -0.09 -0.04 0.06
## sd_dept_id__Intercept 1.36 0.01 0.49 0.75 1.04 1.25 1.56 2.62
## r_dept_id[1,Intercept] 1.25 0.01 0.58 0.08 0.91 1.24 1.59 2.43
## r_dept_id[2,Intercept] 1.20 0.01 0.58 0.02 0.86 1.20 1.55 2.38
## r_dept_id[3,Intercept] -0.01 0.01 0.58 -1.18 -0.35 -0.02 0.34 1.17
## r_dept_id[4,Intercept] -0.04 0.01 0.58 -1.22 -0.39 -0.05 0.30 1.13
## r_dept_id[5,Intercept] -0.48 0.01 0.58 -1.68 -0.83 -0.49 -0.14 0.68
## r_dept_id[6,Intercept] -2.03 0.01 0.59 -3.23 -2.38 -2.03 -1.68 -0.86
## lp__ -60.93 0.05 2.53 -66.96 -62.41 -60.57 -59.08 -57.07
## n_eff Rhat
## b_Intercept 1973 1
## b_male 4652 1
## sd_dept_id__Intercept 2071 1
## r_dept_id[1,Intercept] 1983 1
## r_dept_id[2,Intercept] 1994 1
## r_dept_id[3,Intercept] 1977 1
## r_dept_id[4,Intercept] 1980 1
## r_dept_id[5,Intercept] 1999 1
## r_dept_id[6,Intercept] 2053 1
## lp__ 2404 1
##
## Samples were drawn using NUTS(diag_e) at Mon Oct 5 08:26:05 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
In this model we see no evidence of a difference in admissions probabilities by gender though we do see big departmental variability.
Let’s evaluate whether we need a random effect for gender.
adm3 <-
brm(data = d, family = binomial,
admit | trials(applications) ~ 1 + male + (1 + male | dept_id),
prior = c(prior(normal(0, 3), class = Intercept),
prior(normal(0, 1), class = b),
prior(cauchy(0, 1), class = sd),
prior(lkj(2), class = cor)),
iter = 5000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
control = list(adapt_delta = .99,
max_treedepth = 12))
adm3$fit
## Compiling Stan program...
## Start sampling
## Inference for Stan model: 2c08311663aa21c2a5d74151615de975.
## 4 chains, each with iter=5000; warmup=1000; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=16000.
##
## mean se_mean sd 2.5% 25% 50% 75%
## b_Intercept -0.49 0.01 0.66 -1.81 -0.88 -0.49 -0.10
## b_male -0.16 0.00 0.22 -0.61 -0.29 -0.15 -0.03
## sd_dept_id__Intercept 1.55 0.01 0.57 0.85 1.18 1.43 1.77
## sd_dept_id__male 0.46 0.00 0.23 0.14 0.31 0.42 0.56
## cor_dept_id__Intercept__male -0.32 0.00 0.34 -0.86 -0.58 -0.36 -0.09
## r_dept_id[1,Intercept] 1.77 0.01 0.69 0.41 1.35 1.77 2.19
## r_dept_id[2,Intercept] 1.22 0.01 0.71 -0.17 0.78 1.22 1.65
## r_dept_id[3,Intercept] -0.16 0.01 0.66 -1.48 -0.55 -0.15 0.23
## r_dept_id[4,Intercept] -0.13 0.01 0.66 -1.44 -0.53 -0.12 0.26
## r_dept_id[5,Intercept] -0.64 0.01 0.67 -1.96 -1.04 -0.63 -0.24
## r_dept_id[6,Intercept] -2.11 0.01 0.68 -3.47 -2.52 -2.10 -1.70
## r_dept_id[1,male] -0.61 0.00 0.31 -1.29 -0.80 -0.59 -0.40
## r_dept_id[2,male] -0.04 0.00 0.34 -0.72 -0.25 -0.04 0.16
## r_dept_id[3,male] 0.24 0.00 0.25 -0.22 0.08 0.22 0.39
## r_dept_id[4,male] 0.07 0.00 0.24 -0.42 -0.08 0.06 0.21
## r_dept_id[5,male] 0.27 0.00 0.27 -0.22 0.10 0.26 0.44
## r_dept_id[6,male] 0.04 0.00 0.31 -0.57 -0.15 0.04 0.23
## lp__ -65.40 0.06 3.75 -73.93 -67.62 -64.97 -62.75
## 97.5% n_eff Rhat
## b_Intercept 0.83 3945 1
## b_male 0.28 6930 1
## sd_dept_id__Intercept 2.95 4826 1
## sd_dept_id__male 1.03 5744 1
## cor_dept_id__Intercept__male 0.43 11185 1
## r_dept_id[1,Intercept] 3.16 4189 1
## r_dept_id[2,Intercept] 2.63 4555 1
## r_dept_id[3,Intercept] 1.17 3976 1
## r_dept_id[4,Intercept] 1.21 3980 1
## r_dept_id[5,Intercept] 0.69 3999 1
## r_dept_id[6,Intercept] -0.76 4182 1
## r_dept_id[1,male] -0.06 8219 1
## r_dept_id[2,male] 0.64 12811 1
## r_dept_id[3,male] 0.76 7727 1
## r_dept_id[4,male] 0.57 7666 1
## r_dept_id[5,male] 0.84 8835 1
## r_dept_id[6,male] 0.64 11681 1
## lp__ -59.32 3360 1
##
## Samples were drawn using NUTS(diag_e) at Mon Oct 5 08:37:59 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Before we get too excited let’s look at some diagnostics.
post <- posterior_samples(adm3, add_chain = T)
post %>%
select(-lp__) %>%
gather(key, value, -chain, -iter) %>%
mutate(chain = as.character(chain)) %>%
ggplot(aes(x = iter, y = value, group = chain, color = chain)) +
geom_line(size = 1/15) +
scale_color_manual(values = c("#80A0C7", "#B1934A", "#A65141", "#EEDA9D")) +
scale_x_continuous(NULL, breaks = c(1001, 5000)) +
ylab(NULL) +
theme_pearl_earring +
theme(legend.position = c(.825, .06),
legend.direction = "horizontal") +
facet_wrap(~key, ncol = 3, scales = "free_y")
rbind(coef(adm3)$dept_id[, , 1],
coef(adm3)$dept_id[, , 2]) %>%
as_tibble() %>%
mutate(param = c(paste("Intercept", 1:6), paste("male", 1:6)),
reorder = c(6:1, 12:7)) %>%
# plot
ggplot(aes(x = reorder(param, reorder))) +
geom_hline(yintercept = 0, linetype = 3, color = "#8B9DAF") +
geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5, y = Estimate, color = reorder < 7),
shape = 20, size = 3/4) +
scale_color_manual(values = c("#394165", "#A65141")) +
xlab(NULL) +
coord_flip() +
theme_pearl_earring +
theme(legend.position = "none",
axis.ticks.y = element_blank(),
axis.text.y = element_text(hjust = 0))
We see much more variability in the random intercepts than in the random slopes.
What happened at Berkeley? It actually doesn’t require too much sophisticated modeling. What we’re seeing is Simpson’s paradox.
## dept applicant.gender admit reject successrate
## 1 A male 512 313 0.62060606
## 2 A female 89 19 0.82407407
## 3 B male 353 207 0.63035714
## 4 B female 17 8 0.68000000
## 5 C male 120 205 0.36923077
## 6 C female 202 391 0.34064081
## 7 D male 138 279 0.33093525
## 8 D female 131 244 0.34933333
## 9 E male 53 138 0.27748691
## 10 E female 94 299 0.23918575
## 11 F male 22 351 0.05898123
## 12 F female 24 317 0.07038123
In the raw data, women had higher acceptance probabilities in 4 of the 6 departments. However, the departments to which they applied in higher numbers were the departments that had lower overall acceptance rates.
What happened is that women were more likely to apply do departments like English, which have trouble supporting grad students, and they were less likely to apply to STEM departments, which had more plentiful funding for graduate students. The men, on the other hand, were much more likely to apply to the STEM departments that had higher acceptance rates.