knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lme4)
library(brms) #simple interface with Stan for Bayes MLM
library(rstan)
#devtools::install_github("hrbrmstr/albersusa")
library(albersusa) #plotting
library(cowplot) # plotting
library(directlabels)
library(tidybayes) #work easily with posterior samples
library(dplyr)
rstan_options(auto_write=TRUE)
options(mc.cores=parallel::detectCores())
First, we use multilevel regression to model individual survey responses as a function of demographic and geographic predictors.
Then we use poststratification, in which we weight (poststratify) the estimates for each demographic-geographic respondent type by the percentages of each type in the actual state populations.
This tutorial draws heavily on Jonathan Kastellec’s MrP Primer and Tim Mastny’s version using Stan. You may find the paper at their website useful in addition to the shorter version presented here. First, download all three datasets from Sakai.
The goal is to estimate support for gay marriage in each state based on survey data that are potentially non-representative. Because not all subgroups of the population are equally likely to respond to polls, we worry that relying only on a survey could lead to biased estimates of support for gay marriage. (For example, younger people may be more likely than older people to answer questions about gay marriage; however, older people may be more likely than younger people to answer the phone and participate in a survey.)
The US Census is a good source of information about characteristics of the full US population. Using Census data, we can scale the average support of population subgroups of interest in proportion to their representation in the state population.
Demographic data are often of limited quality. For the US Census, respondents are only given a binary sex classification. Data quality is further limited by having one person (“head of household”) make this classification for all household residents.
On the NC voter registration form, gender is recorded as binary, and race and ethnicity are self-classified into a small number of groups.
If we were to combine data from the US Census with data from the NC voter registration files, we would need to make some decisions about how to deal with missing data, classifications that are not 1:1 mappings, and variables with differing definitions (biological sex versus gender identity, to the extent respondents distinguish between them).
marriage.data=foreign::read.dta('data/gay_marriage_megapoll.dta',
convert.underscore=TRUE)
marriage.data$state=as.factor(marriage.data$state)
marriage.data$region=as.factor(marriage.data$region)
Statelevel=foreign::read.dta('data/state_level_update.dta',
convert.underscore=TRUE)
Census=foreign::read.dta('data/poststratification 2000.dta',
convert.underscore=TRUE)
Statelevel=Statelevel %>% rename(state=sstate) #rename to state
marriage.data=Statelevel %>%
select(state,p.evang,p.mormon,kerry.04) %>%
right_join(marriage.data)
## Joining, by = "state"
library(tidyverse)
# combine demographic groups and label them
marriage.data$race.female <- (marriage.data$female *3) + marriage.data$race.wbh
marriage.data$race.female=factor(marriage.data$race.female,levels=1:6,
labels=c("WhMale","BlMale","HMale","WhFem","BlFem","HFem"))
marriage.data$age.edu.cat <- 4 * (marriage.data$age.cat -1) + marriage.data$edu.cat
marriage.data$age.edu.cat=factor(marriage.data$age.edu.cat,levels=1:16,
labels=c("18-29,<HS","18-29,HS","18-29,SC","18-29,CG",
"30-44,<HS","30-44,HS","30-44,SC","30-44,CG",
"45-64,<HS","45-64,HS","45-64,SC","45-64,CG",
"65+,<HS","65+,HS","65+,SC","65+,CG"))
marriage.data$p.evang <- Statelevel$p.evang[marriage.data$state.initnum]
# proportion of evangelicals in respondent's state
marriage.data$p.mormon <-Statelevel$p.mormon[marriage.data$state.initnum]
# proportion of LDS church members in respondent's state
marriage.data$p.relig <- marriage.data$p.evang + marriage.data$p.mormon
# combined evangelical + LDS proportions
marriage.data$kerry.04 <- Statelevel$kerry.04[marriage.data$state.initnum]
# John Kerry's % of 2-party vote in respondent's state in 2004
marriage.data <- marriage.data %>%
filter(state!="")
# Census variables
Census<-Census %>%
rename(state=cstate, age.cat=cage.cat, edu.cat=cedu.cat,region=cregion)
Census$race.female <- (Census$cfemale *3) + Census$crace.WBH
Census$race.female=factor(Census$race.female,levels=1:6,
labels=c("WhMale","BlMale","HMale","WhFem","BlFem","HFem"))
Census$age.edu.cat <- 4 * (Census$age.cat-1) + Census$edu.cat
Census$age.edu.cat=factor(Census$age.edu.cat,levels=1:16,
labels=c("18-29,<HS","18-29,HS","18-29,SC","18-29,CG",
"30-44,<HS","30-44,HS","30-44,SC","30-44,CG",
"45-64,<HS","45-64,HS","45-64,SC","45-64,CG",
"65+,<HS","65+,HS","65+,SC","65+,CG"))
Census <- Statelevel %>%
select(state,p.evang,p.mormon,kerry.04) %>%
right_join(Census)
## Joining, by = "state"
We’ll call these averages a disaggregate model – just taking the mean responses within each state.
# Get state averages
mod.disag=marriage.data%>%
group_by(state) %>%
summarise(support=mean(yes.of.all)) %>%
mutate(model="no_ps")
These averages will not be representative of the actual statewide means if the sampled respondents are not in proportion to each group’s percentage of the total state population. So we’ll next poststratify.
First, we find within-group averages in each state.
Next we add the group’s percentage in each state.
#compare poststratified and empirical means -- nice plot!
disag.point <- bind_rows(mod.disag,mod.disag.ps) %>%
group_by(model) %>%
arrange(support, .by_group=TRUE) %>%
ggplot(aes(x=support,y=forcats::fct_inorder(state),color=model)) +
geom_point() + theme_classic() +
theme(legend.position='none') +
directlabels::geom_dl(aes(label=model),method='smart.grid') +
ylab('state')
disag.point
Variation in poststratified estimates is pretty large. Also, the poststratified estimates appear closer to zero – what is going on?
Let’s sum the percentages in the individual marriage data by state to make sure they all sum to 1.
grp.means %>%
group_by(state) %>%
summarize(total_percent=sum(cpercent.state, na.rm=TRUE)) %>%
filter(state != "") %>%
ggplot(aes(x=state,y=total_percent)) +
geom_text(aes(label=state),hjust=0.5,vjust=0.25) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
coord_fixed(ratio=8)+ylim(c(0,1.1))
Ahh, the surveys do not have responses from each demographic group in each state. Our poststratification is assuming the missing demographic groups have 0% support, which is not good – even though we have no black men from South Dakota in the survey, there are some in the state (1.7% of the SD population identifies as black or African-American). We underestimate the level of support by assuming no black men in SD support gay marriage.
One advantage of fitting a multilevel model is that we can borrow information to get better estimates.
In the case of African-American men from South Dakota, we do have responses from black men in nearby states (North Dakota has roughly 3x the African-American population of South Dakota) and other US states, which we can use to make a better guess about the level of support for gay marriage among black men in South Dakota.
Now we fit a regression model for individual survey responses on gay marriage rights given demographics and geography, i.e. each individual’s response will be a function of their demographics and state.
We’ll denote each individual as i with indices for race-gender combination, age-education combination, region, and state. We let \[y_i=1\] for supporters of same-sex marriage and 0 for opponents and those with no opinion.
\[ \mathrm{logit}\left(\mathrm{Pr}(y_{ijks} = 1)\right) = \beta_0 + \alpha^{race,gender}_{j[i]} + \alpha^{age,edu}_{k[i]} + \alpha^{state}_{s[i]} \]
Here \(i\) indexes the individual, \(j\) the race/gender group, \(k\) the age/education group, and \(s\) the state. We can think of the terms after the intercept as modeled effects for different groups of respondents such as individuals who are a specific age. All of them except the state coefficient will be modeled as drawn from a normal distribution with mean zero and some estimated variance specific to that variable. For example,
\[\begin{eqnarray*} \alpha^{race,gender}_j &\sim& N(0, \sigma^2_{race,gender}), ~~~\mathrm{j=1, ..., 6} \\ \alpha^{age,edu}_k &\sim& N(0, \sigma^2_{age,edu}), ~~~\mathrm{k=1,...,16} \end{eqnarray*}\]
For the state effect, we model the mean for the state effect as a function of 3 state level variables: the region into which the state falls, the state’s conservative (defined as evangelical+LDS) religious percentage, and its Democratic 2004 presidential vote share. Here \(m[s]\) indexes the region \(m\) containing state or entity \(s\) (regions include the District of Columbia, Midwest, Northeast, South, and West).
\[\alpha^{state}_s \sim N(\alpha^{reg}_{m[s]} + \beta^{relig}\cdot relig_s + \beta^{vote} \cdot vote_s, \sigma^2_{state}),\] \(\mathrm{s = 1, ..., 51}\), \[\alpha^{reg}_m \sim N(0,\sigma^2_{region})\]
#run individual-level opinion model
ml.mod <- glmer(formula = yes.of.all ~ (1|race.female)
+ (1|age.edu.cat)
+ (1|state) + (1|region) +
+ p.relig + kerry.04,
data=marriage.data, family=binomial(link="logit"))
## boundary (singular) fit: see ?isSingular
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: yes.of.all ~ (1 | race.female) + (1 | age.edu.cat) + (1 | state) +
## (1 | region) + +p.relig + kerry.04
## Data: marriage.data
##
## AIC BIC logLik deviance df.resid
## 7504.8 7552.1 -3745.4 7490.8 6334
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8404 -0.7100 -0.4845 0.9989 3.8023
##
## Random effects:
## Groups Name Variance Std.Dev.
## state (Intercept) 0.00000 0.0000
## age.edu.cat (Intercept) 0.39449 0.6281
## race.female (Intercept) 0.04959 0.2227
## region (Intercept) 0.03519 0.1876
## Number of obs: 6341, groups:
## state, 49; age.edu.cat, 16; race.female, 6; region, 5
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.497277 0.437339 -3.424 0.000618 ***
## p.relig -0.014779 0.004889 -3.023 0.002503 **
## kerry.04 0.019112 0.006755 2.829 0.004664 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) p.relg
## p.relig -0.660
## kerry.04 -0.868 0.661
## convergence code: 0
## boundary (singular) fit: see ?isSingular
Note we have no responses from AK or HI.
# note nobody from AK or HI in survey
library(tidyverse)
marriage.data %>%
filter(state=="AK", state=="HI")
## [1] state p.evang p.mormon
## [4] kerry.04 poll poll.firm
## [7] poll.year id statenum
## [10] statename region.cat female
## [13] race.wbh edu.cat age.cat
## [16] age.cat6 age.edu.cat6 educ
## [19] age democrat republican
## [22] black hispanic weight
## [25] yes.of.opinion.holders yes.of.all state.initnum
## [28] region no.of.all no.of.opinion.holders
## [31] race.female age.edu.cat p.relig
## <0 rows> (or 0-length row.names)
Using the predict package, we make predictions in states, broken out by the demographic groups of interest, which will allow us to poststratify down the road. For now we calculate the predictions, and we’ll examine them closely later.
Now we fit a fully Bayesian model, with same data model as the ML model but with some weakly informative priors on the SD’s of varying intercepts that will help with borrowing of information and convergence.
bayes.mod <- brm(yes.of.all ~ (1|race.female)
+ (1|age.edu.cat) + (1|state) + (1|region)
+ p.relig + kerry.04,
data=marriage.data, family=bernoulli(),
prior=c(set_prior("normal(0,0.2)", class='b'),
set_prior("normal(0,0.2)", class='sd', group="race.female"),
set_prior("normal(0,0.2)", class='sd', group="age.cat"),
set_prior("normal(0,0.2)", class='sd', group="edu.cat"),
set_prior("normal(0,0.2)", class='sd', group="age.edu.cat"),
set_prior("normal(0,0.2)", class='sd', group="state"),
set_prior("normal(0,0.2)", class='sd', group="region")))
##
## SAMPLING FOR MODEL '7bee431173565af885bd8e0929266588' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.002087 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 20.87 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 146.9 seconds (Warm-up)
## Chain 1: 49.4163 seconds (Sampling)
## Chain 1: 196.317 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '7bee431173565af885bd8e0929266588' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.001142 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 11.42 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 143.108 seconds (Warm-up)
## Chain 2: 56.2453 seconds (Sampling)
## Chain 2: 199.353 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '7bee431173565af885bd8e0929266588' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0.001482 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 14.82 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 136.64 seconds (Warm-up)
## Chain 3: 41.3575 seconds (Sampling)
## Chain 3: 177.997 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '7bee431173565af885bd8e0929266588' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0.001117 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 11.17 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 142.378 seconds (Warm-up)
## Chain 4: 43.1261 seconds (Sampling)
## Chain 4: 185.504 seconds (Total)
## Chain 4:
## Family: bernoulli
## Links: mu = logit
## Formula: yes.of.all ~ (1 | race.female) + (1 | age.edu.cat) + (1 | state) + (1 | region) + p.relig + kerry.04
## Data: marriage.data (Number of observations: 6341)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~age.edu.cat (Number of levels: 16)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.53 0.08 0.40 0.70 1.01 1094 1752
##
## ~race.female (Number of levels: 6)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.23 0.07 0.12 0.40 1.00 1631 2405
##
## ~region (Number of levels: 5)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.21 0.08 0.10 0.41 1.00 2058 2386
##
## ~state (Number of levels: 49)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.06 0.04 0.00 0.15 1.01 1074 1221
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -1.50 0.46 -2.43 -0.62 1.00 1916 2687
## p.relig -0.01 0.01 -0.02 -0.00 1.00 3352 2825
## kerry.04 0.02 0.01 0.01 0.03 1.00 3607 2495
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The most obvious benefit of a Bayesian approach is the total accounting of uncertainty, as we can easily see by comparing the estimated SD’s of the group-level intercepts in the frequentist model to the posteriors from the Bayesian model.
Recall the sd estimates from the frequentist model were around 0 for state, 0.19 for region, 0.22 for race.female, and 0.63 for age.edu.cat.
Wow, this is cool! The Bayesian model gives you an idea of the full posterior distribution of values, from which we can sample, as opposed to a single point estimate.
#next let's get the point estimate and poststratify from the Bayesian model
ps.bayes.mod <- bayes.mod %>%
add_predicted_samples(newdata=Census, allow_new_levels=TRUE) %>%
rename(support = pred) %>%
mean_qi() %>%
mutate(support = support * cpercent.state) %>%
group_by(state) %>%
summarize(support = sum(support))
## `summarise()` ungrouping output (override with `.groups` argument)
Now we consider comparisons across the 3 approaches.
mod.disag[nrow(mod.disag) + 1,] = list("AK", mean(mod.disag$support), "no_ps")
mod.disag[nrow(mod.disag) + 1,] = list("HI", mean(mod.disag$support), "no_ps")
test=full_join(mod.disag,ps.ml.mod,by=c("state"),suffix=c("_disag","_ml"))
test2=full_join(test,ps.bayes.mod,by=c("state"))
test2$support_bayes=test2$support
ggplot(data=test2, aes(x = support_bayes, y=support_ml)) +
geom_text(aes(label=state), hjust=0.5, vjust=0.25) +
geom_abline(slope = 1, intercept = 0) +
xlim(c(0,0.7)) + ylim(c(0,0.7)) +
xlab("Bayes poststrat support") + ylab("ML poststrat support") +
coord_fixed()
ggplot(data=test2, aes(x = support_bayes, y=support_disag)) +
geom_text(aes(label=state), hjust=0.5, vjust=0.25) +
geom_abline(slope = 1, intercept = 0) +
xlim(c(0,0.7)) + ylim(c(0,0.7)) +
xlab("Bayes poststrat support") + ylab("Unstratified support") +
coord_fixed()
Now we consider comparisons across the 3 approaches.
Note our predicted probabilities from the ML and Bayesian approaches are similar, despite having different parameter estimates, and the models disagree with the disaggregated model, which does not borrow information.
Now we can evaluate predictions, taking advantage of the uncertainty quantification advantages of the Bayesian approach. We will sample from the posterior to get predicted probabilities for each group of interest based on proportions obtained from the Census data.
## [1] 4896 15
## state p.evang p.mormon kerry.04 crace.WBH age.cat edu.cat cfemale .freq
## 1 AK 12.44 3.003126 35.5 1 1 1 0 467
## 2 AK 12.44 3.003126 35.5 1 2 1 0 377
## 3 AK 12.44 3.003126 35.5 1 3 1 0 419
## 4 AK 12.44 3.003126 35.5 1 4 1 0 343
## 5 AK 12.44 3.003126 35.5 1 1 2 0 958
## 6 AK 12.44 3.003126 35.5 1 2 2 0 1359
## cfreq.state cpercent.state region race.female age.edu.cat p.relig
## 1 21222 0.02200547 west WhMale 18-29,<HS 15.44313
## 2 21222 0.01776458 west WhMale 30-44,<HS 15.44313
## 3 21222 0.01974366 west WhMale 45-64,<HS 15.44313
## 4 21222 0.01616247 west WhMale 65+,<HS 15.44313
## 5 21222 0.04514183 west WhMale 18-29,HS 15.44313
## 6 21222 0.06403732 west WhMale 30-44,HS 15.44313
We can generate individual samples from our Bayesian model’s predicted probabilities and use them to estimate public opinion under a variety of assumptions (e.g., estimating opinion of all residents, or just of likely voters).
library(tidyverse)
options(width = 10000)
str(predict(bayes.mod, newdata=Census, allow_new_levels=TRUE,
nsamples=500, summary=FALSE))
## int [1:500, 1:4896] 0 1 1 0 0 1 0 0 0 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : NULL