Introduction to Multilevel Models

Multilevel models

Multilevel models are a natural way to handle correlated or clustered data. Our goals for studying multilevel modeling include the following.

Understanding consequences of ignoring correlation in data
Learning how to construct and apply multilevel models for correlated data
Using multilevel models to borrow information to inform predictions for continuous and categorical data.

Clustered data

Clustering of observations generally leads to positive correlation among members of the same unit due to shared characteristics.

The TAAG Study (Trial of Activity for Adolescent Girls) is a study of school-based interventions to promote physical activity. The schools themselves are randomized to the interventions, and the girls are clustered within schools (also within grades in schools).
In studies of social determinants of health, we may study families clustered by neighborhood within a community.

Clustered data

In developmental toxicity studies, pregnant dams are dosed with a study chemical, and outcomes are measured in the offspring.
In a longitudinal study, repeated measurement occasions are nested within subjects (e.g., measure height of kids annually)

Hierarchical linear model

The hierarchical model is the dominant approach to analysis of multilevel data. The correlation induced by clustering is described by random effects at each “level.” The higher levels in the hierarchy involve clusters of members at the lower levels.

Pooling motivation

Multilevel regression is a method for compromising between two extremes:

Excluding a categorical predictor from a model (complete pooling)
Estimating separate models within each level of the categorical predictor (no pooling)

Sometimes, the no pooling approach is not a good option due to limited sample sizes within groups. With a multilevel model, we can get decent estimates for small groups by borrowing information.

Radon study

Radon, a naturally-occurring radioactive gas, is a carcinogen known to cause lung cancer in high concentrations. Several thousand lung cancer deaths in the US per year are attributable to radon exposure.

Radon levels in US homes varies greatly, and some homes have dangerously high radon levels. In order to identify areas of the US with high radon exposures, the US EPA conducted a study of radon levels in a random sample of more than 80,000 homes.

Click here to check highest recorded radon levels in your area. Note that these levels may come from short-term home test kits, which vary widely in accuracy.

Radon study

We wish to estimate the distribution of radon levels in houses \(i\) within 85 counties \(j\) in Minnesota. The outcome \(y_i\) is the natural log of measured radon levels.

One estimate would be the average of all radon levels in Minnesota (same estimate for all counties), \(\overline{y}_{\cdot \cdot}\), but this ignores variation across counties, and some counties may have higher radon levels naturally than others (radon is more commonly found in soils with granite rock, as opposed to some other soil types).
Another estimate would be just to average the radon level in each county, \(\overline{y}_j\), which can over-fit the data within county (for example, Lac Qui Parle County, which has the highest radon level of all the 85 MN counties, has radon measures from only 2 homes).

Data exploration

Estimates and SE’s for log radon levels in MN versus the (jittered) sample size. The horizontal line indicates the complete pooling estimate (Gelman and Hill).

Radon study

The multilevel estimates in the previous slide represent a compromise between the two extremes. In this simple setting (with no predictors), the multilevel estimate for county \(j\) can be approximated as a weighted average of the mean of all observations in the county, weighting both the unpooled estimate \(\overline{y}_j\) and the mean over all counties \(\overline{y}_{\cdot \cdot}\).

Radon study

\[\widehat{\alpha}_j \approx \frac{\frac{n_j}{\sigma^2_y}\overline{y}_j+\frac{1}{\sigma^2_{\alpha}}\overline{y}_{\cdot\cdot}}{\frac{n_j}{\sigma^2_y}+\frac{1}{\sigma^2_{\alpha}}},\] where

\(n_j\) is the number of homes measured in county \(j\)
\(\sigma^2_y\) is the within-county variance in the log radon measurements
\(\sigma^2_{\alpha}\) is the variance across the average log radon levels of different counties

Radon study

The weighted average reflects the relative amount of information available on each individual county, compared to that available across all counties.

Averages from counties with smaller sample sizes are less precise, so the weighting shrinks the multilevel estimates closer to the overall state average. For example, if \(n_j=0,\) the multilevel estimate is just \(\overline{y}_{\cdot \cdot}\).
Averages from counties with larger sample sizes are more precise, and the multilevel estimates are closer to the county averages. As \(n_j \rightarrow \infty\), the multilevel estimate is just the county average \(\overline{y}_j\).
In intermediate cases, the multilevel estimate is in between the extremes.

Radon study

In practice, we carry out all estimation together (estimate variances along with the \(\alpha\) parameters), but we won’t worry about this yet.

Code for the radon analysis as presented in Gelman and Hill is available at this link .

Location of measurement in radon study

One important predictor of radon levels is the floor on which the measurement is taken: basement (\(x_i=0\)) or first floor (\(x_i=1\)); radon comes from underground and can enter more easily when the house is built into the ground.

First, we examine the complete-pooling regression, \(y_i=\alpha+\beta x_i + \varepsilon_i\), and the no-pooling regression \(y_i=\alpha_{j[i]}+\beta x_i + \varepsilon_i\), where \(\alpha_{j[i]}\) is the mean log radon level from basement measures of homes (indexed by i) in county \(j\).

The following plot shows the dashed lines \(\widehat{y}=\widehat{\alpha}+\widehat{\beta} x\) for eight selected counties from the complete pooling model, and the solid lines \(\widehat{y}=\widehat{\alpha}_j+\widehat{\beta}x\) from no pooling model.

No pooling and pooling

Complete-pooling (dashed) and no-pooling (solid) regression fits to radon data (Gelman and Hill)

Interpretation

The estimates of \(\beta\) (the association between floor of home and radon level) differ slightly for the two regressions, with \(\widehat{\beta}=-0.61\) for the pooling model, and \(\widehat{\beta}=-0.72\) for the no-pooling model. (Click here for code and output) As we might expect, we tend to have higher radon levels in the basement (p<0.0001). (Note: the models differ a LOT in adjusted \(R^2\): 0.07 in the complete pooling model, and 0.74 in the no pooling model.)

Neither analysis is perfect. The complete-pooling analysis ignores variation in radon levels between counties, which is undesirable because our goal is to identify counties with high-radon homes – we can’t pool away the main research question! The no-pooling analysis is also problematic – for example the Lac Qui Parle County line is estimated based on just two data points.

Multilevel model

We will start with a simple multilevel model, \(y_i=\gamma_0 + \alpha_{j[i]}+\beta x_i + \varepsilon_i\), where now \(\alpha_{j} \sim N(0,\sigma^2_\alpha)\) and \(\varepsilon_i \sim N(0, \sigma^2_y)\). We fit this model using the lmer() function in the lme4 package.

This model can also be written \(y_i \sim N(\alpha_{j[i]}+\beta x_i, \sigma_y^2),\)

\(\alpha_j \sim N\left(\gamma_0,\sigma_\alpha^2 \right)\)

Code to fit multilevel model

library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

#basic MLM with just random intercept for county
M1<-lmer(y~x+(1|county))

Output from MLM

summary(M1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ x + (1 | county)
## 
## REML criterion at convergence: 2171.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3989 -0.6155  0.0029  0.6405  3.4281 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept) 0.1077   0.3282  
##  Residual             0.5709   0.7556  
## Number of obs: 919, groups:  county, 85
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1.46160    0.05158  28.339
## x           -0.69299    0.07043  -9.839
## 
## Correlation of Fixed Effects:
##   (Intr)
## x -0.288

The ratio of county-level variance to total variance is an estimate of the correlation of within-county measures. Here \(\hat{\rho}=\frac{.1077}{.1077+.5709}=0.16\).

County-level estimates

coef(M1)

## $county
##    (Intercept)          x
## 1    1.1915004 -0.6929937
## 2    0.9276468 -0.6929937
## 3    1.4792143 -0.6929937
## 4    1.5045012 -0.6929937
## 5    1.4461503 -0.6929937
## 6    1.4801817 -0.6929937
## 7    1.8581255 -0.6929937
## 8    1.6827736 -0.6929937
## 9    1.1600747 -0.6929937
## 10   1.5086099 -0.6929937
## 11   1.4322449 -0.6929937
## 12   1.5771520 -0.6929937
## 13   1.2370518 -0.6929937
## 14   1.8380232 -0.6929937
## 15   1.4024982 -0.6929937
## 16   1.2432992 -0.6929937
## 17   1.3723633 -0.6929937
## 18   1.2209415 -0.6929937
## 19   1.3462611 -0.6929937
## 20   1.5840332 -0.6929937
## 21   1.6311136 -0.6929937
## 22   1.0211903 -0.6929937
## 23   1.4409443 -0.6929937
## 24   1.8605721 -0.6929937
## 25   1.8135585 -0.6929937
## 26   1.3626875 -0.6929937
## 27   1.6222663 -0.6929937
## 28   1.3467692 -0.6929937
## 29   1.3149879 -0.6929937
## 30   1.0999775 -0.6929937
## 31   1.7329562 -0.6929937
## 32   1.3646863 -0.6929937
## 33   1.7197950 -0.6929937
## 34   1.5015319 -0.6929937
## 35   1.0870316 -0.6929937
## 36   1.8680899 -0.6929937
## 37   0.7928242 -0.6929937
## 38   1.6303574 -0.6929937
## 39   1.5979922 -0.6929937
## 40   1.8260564 -0.6929937
## 41   1.7636308 -0.6929937
## 42   1.4456250 -0.6929937
## 43   1.5404841 -0.6929937
## 44   1.2199767 -0.6929937
## 45   1.3375197 -0.6929937
## 46   1.3416955 -0.6929937
## 47   1.2995481 -0.6929937
## 48   1.2623708 -0.6929937
## 49   1.6294468 -0.6929937
## 50   1.6253580 -0.6929937
## 51   1.7641693 -0.6929937
## 52   1.6300755 -0.6929937
## 53   1.3820836 -0.6929937
## 54   1.3328317 -0.6929937
## 55   1.5494611 -0.6929937
## 56   1.3246513 -0.6929937
## 57   1.0877739 -0.6929937
## 58   1.6303984 -0.6929937
## 59   1.5675867 -0.6929937
## 60   1.4116740 -0.6929937
## 61   1.1995431 -0.6929937
## 62   1.7120492 -0.6929937
## 63   1.5338618 -0.6929937
## 64   1.7205672 -0.6929937
## 65   1.4170797 -0.6929937
## 66   1.5982671 -0.6929937
## 67   1.6981946 -0.6929937
## 68   1.2380854 -0.6929937
## 69   1.3673371 -0.6929937
## 70   0.8899487 -0.6929937
## 71   1.4829168 -0.6929937
## 72   1.5389227 -0.6929937
## 73   1.5520593 -0.6929937
## 74   1.2574763 -0.6929937
## 75   1.5530274 -0.6929937
## 76   1.6938490 -0.6929937
## 77   1.6642922 -0.6929937
## 78   1.3708520 -0.6929937
## 79   1.0944378 -0.6929937
## 80   1.3404792 -0.6929937
## 81   1.9060482 -0.6929937
## 82   1.5835784 -0.6929937
## 83   1.5716875 -0.6929937
## 84   1.5906331 -0.6929937
## 85   1.3862294 -0.6929937
## 
## attr(,"class")
## [1] "coef.mer"

The first column gives estimates of \(\alpha_j\), and the second column gives the estimate of \(\beta\) (does not vary over county \(j\) according to the model we specified), for the first 19 of 85 counties.

Multilevel model fit

No, complete, and partial pooling models

Note that there is strong pooling in counties with few data points (multilevel line is closer to the dashed line, as in Lac Qui Parle county), and weak pooling in counties with more data points (multilevel line is on top of no pooling line in St. Louis county).

Summary

Multilevel models can provide a good fit to the data that is a compromise between complete and no pooling across groups. We will delve more deeply into the topic with a focus on multilevel models for binary data.