Multilevel Models for Binary Data

Example: Lung cancer

A large HMO wants to know what patient and physician factors are most related to whether a patient’s lung cancer goes into remission after treatment as part of a larger study of treatment outcomes and quality of life in patients with lunger cancer.

We consider simulated data that mimic real data relevant to this question. This simulation was created by the UCLA Institute for Digital Research and Education (IDRE), which has great examples online of many different analyses. Check it out!

Lung cancer data

In this hypothetical study, a variety of information was collected from patients, including the names of their doctors and hospitals. One can think of a nested structure in which a given hospital has many doctors, and a given doctor in the hospital has many patients in that hospital.

Let’s read in the data and do some basic processing.

lung=read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
lung=within(lung, {
  Married=factor(Married,levels=0:1,labels=c("no","yes"))
  remission=factor(remission,levels=0:1,labels=c("no","yes"))
  DID=factor(DID) # doctor ID
  HID=factor(HID) # hospital ID
})

Data cleaning

Let’s look at basic descriptive statistics using the R function summary. Because we have a fair number of variables in these data, instead of typing summary(varname) for every single variable, it’s easier to use the handy sapply function. The sapply function can be used to apply another function (like summary) to every element of a data frame as follows.

sapply(lung,summary,na.rm=TRUE)

## $tumorsize
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.97   62.49   70.07   70.88   79.02  116.46 
## 
## $co2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.222   1.519   1.601   1.605   1.687   2.128 
## 
## $pain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   5.473   6.000   9.000 
## 
## $wound
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   5.732   7.000   9.000 
## 
## $mobility
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    6.00    6.08    7.00    9.00 
## 
## $ntumors
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.066   5.000   9.000 
## 
## $nmorphine
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.624   5.000  18.000 
## 
## $remission
##   no  yes 
## 6004 2521 
## 
## $lungcapacity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01612 0.67647 0.81560 0.77409 0.91150 0.99980 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.32   46.69   50.93   50.97   55.27   74.48 
## 
## $Married
##   no  yes 
## 3410 5115 
## 
## $FamilyHx
##   no  yes 
## 6820 1705 
## 
## $SmokingHx
## current  former   never 
##    1705    1705    5115 
## 
## $Sex
## female   male 
##   5115   3410 
## 
## $CancerStage
##    I   II  III   IV 
## 2558 3409 1705  853 
## 
## $LengthofStay
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   5.000   5.492   6.000  10.000 
## 
## $WBC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2131    5323    6007    5998    6663    9776 
## 
## $RBC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.919   4.802   4.994   4.995   5.190   6.065 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.38   24.20   27.73   29.07   32.54   58.00 
## 
## $IL6
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.03521  1.93039  3.34400  4.01698  5.40551 23.72777 
## 
## $CRP
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0451  2.6968  4.3330  4.9730  6.5952 28.7421 
## 
## $DID
##      69      76      86     104     258     271     289     296     306 
##      40      40      40      40      40      40      40      40      40 
##     320     349     376      45     176     183     184     284     291 
##      40      40      40      39      39      39      39      39      39 
##     305     355     368      74     100     154     159     179     190 
##      39      39      39      38      38      38      38      38      38 
##     215     217     225     265     338     358     370     375     393 
##      38      38      38      38      38      38      38      38      38 
##     396      26     145     208     369     405      35      41      59 
##      38      37      37      37      37      37      36      36      36 
##     127     130     141     148     150     248     324     395      17 
##      36      36      36      36      36      36      36      36      35 
##      42     115     138     195     290     322     343       6      37 
##      35      35      35      35      35      35      35      34      34 
##     137     163     278     328     331     353     385     394      61 
##      34      34      34      34      34      34      34      34      33 
##      64     180     273     292     299     352     401       2       8 
##      33      33      33      33      33      33      33      32      32 
##      11      44      60     129     165     229     245     293     318 
##      32      32      32      32      32      32      32      32      32 
##     373     379      33      50      90     113     152     178     216 
##      32      32      31      31      31      31      31      31      31 
## (Other) 
##    4988 
## 
## $Experience
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   15.00   18.00   17.64   21.00   29.00 
## 
## $School
## average     top 
##    6405    2120 
## 
## $Lawsuits
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.866   3.000   9.000 
## 
## $HID
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 252 202 227 377 135 356 322 224 302 174 157 213 312 255 276 261 243 307 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
## 248 211 208 134 246 209 293 222 153 229 300 217 219 291 271 183 296 
## 
## $Medicaid
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1416  0.3369  0.5215  0.5125  0.7083  0.8187

Looking deeper

There are so many doctor ID’s that R stopped printing. Let’s count them just so that we know. At the same time we will see how many observations are in our data.

length(lung$DID) #how many rows in data (i.e., people)

## [1] 8525

length(unique(lung$DID)) # how many doctors in data

## [1] 407

length(unique(lung$HID)) # how many hospitals in data

## [1] 35

Variables

The lung cancer dataset contains variables at the patient, doctor, and hospital level.

At the hospital level, we have a hospital identifier and the proportion of patients in the hospital who are insured by Medicaid.

At the physician level, we have a doctor ID, the number of years of experience held by the doctor, an indicator of medical school quality (average or high), and the number of lawsuits against the doctor.

Other variables are at the patient level and include the outcome of remission (1=yes, 0=no), tumor and other health characteristics, demographic and medical history variables, and the length of stay in the current hospital visit.

Data exploration

We will look at the distribution of covariates for those participants who achieved remission and those who did not.

library(reshape2)
library(ggplot2)
tmp <- melt(lung[, c("remission", "IL6", "CRP", "LengthofStay", "co2")],
  id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
  geom_boxplot() +
  facet_wrap(~variable, scales="free_y")

Boxplots

tmp <- melt(lung[, c("remission", "tumorsize", "ntumors", "lungcapacity", "pain")],
  id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
  geom_boxplot() +
  facet_wrap(~variable, scales="free_y")

Boxplots

tmp <- melt(lung[, c("remission", "Age","Experience","Lawsuits","Medicaid")],
  id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
  geom_boxplot() +
  facet_wrap(~variable, scales="free_y")

Boxplots

Categorical predictors

We can also examine how categorical predictors are associated with remission.

prop.table(table(lung$ntumors,lung$remission),1)

##    
##            no       yes
##   0 0.7482066 0.2517934
##   1 0.6840640 0.3159360
##   2 0.6666667 0.3333333
##   3 0.6621392 0.3378608
##   4 0.6847826 0.3152174
##   5 0.6978193 0.3021807
##   6 0.7072165 0.2927835
##   7 0.7412399 0.2587601
##   8 0.7854406 0.2145594
##   9 0.8348837 0.1651163

prop.table(table(lung$Married,lung$remission),1)

##      
##              no       yes
##   no  0.7149560 0.2850440
##   yes 0.6971652 0.3028348

Categorical predictors

We can also examine how categorical predictors are associated with remission.

prop.table(table(lung$FamilyHx,lung$remission),1)

##      
##              no       yes
##   no  0.6693548 0.3306452
##   yes 0.8439883 0.1560117

prop.table(table(lung$Sex,lung$remission),1)

##         
##                 no       yes
##   female 0.7088954 0.2911046
##   male   0.6973607 0.3026393

Categorical predictors

We can also examine how categorical predictors are associated with remission.

prop.table(table(lung$CancerStage,lung$remission),1)

##      
##              no       yes
##   I   0.6176701 0.3823299
##   II  0.6914051 0.3085949
##   III 0.7630499 0.2369501
##   IV  0.8980070 0.1019930

prop.table(table(lung$School,lung$remission),1)

##          
##                  no       yes
##   average 0.6927400 0.3072600
##   top     0.7391509 0.2608491

Multilevel logistic regression

Multilevel models can be used for binary outcomes (and those on other scales) using a similar approach to that used for normal data: we group coefficients into batches, and a probability distribution is assigned to each batch. Recall our normal data model: \[y_i \sim N(\alpha_{j[i]}+\beta x_i, \sigma_y^2), ~~~ \alpha_j \sim N\left(\gamma_0,\sigma_\alpha^2 \right).\] In this case we grouped the county-level intercepts and assigned them a normal distribution. This method is a compromise between complete pooling across counties (same intercept for each county) and no pooling (estimating a separate intercept for each county without borrowing information), where the degree of pooling is determined by the amount of information within and between counties.

Multilevel linear regression

Another way to think of this model is as a way of handling correlated responses – due to geography (in the radon case), we expect homes closer together to have more similar radon levels than homes further apart. One rough way to account for this spatial correlation is by including a random effect, \(\alpha_j\), for county. The model under each motivation (adjusting for correlation within counties, or borrowing information across counties) is the same.

Fitting the multilevel logistic model

Let’s examine the probability of remission as a function of IL6, CRP, cancer stage, length of stay, family history, and physician experience. We’ll have a random effect for doctor to control for possible correlation within physician.

library(lme4)
m1=glmer(remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + FamilyHx
         +  (1 | DID), family = binomial, data=lung,
         control=glmerControl(optimizer="bobyqa"),
         nAGQ=10,optCtrl=list(maxfun=2e5)) 
summary(m1)
#these models can be flaky. Specifying a different optimizer to avoid 
# convergence problems with default

Multilevel logistic model

## Warning: extra argument(s) 'optCtrl' disregarded

## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience +  
##     FamilyHx + Age + (1 | DID)
##    Data: lung
## Control: glmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   7188.4   7266.0  -3583.2   7166.4     8514 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1656 -0.4218 -0.1859  0.3562  7.9281 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  DID    (Intercept) 4.275    2.068   
## Number of obs: 8525, groups:  DID, 407
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.633402   0.590364  -2.767  0.00566 ** 
## IL6            -0.058305   0.011740  -4.967 6.82e-07 ***
## CRP            -0.023239   0.010382  -2.238  0.02520 *  
## CancerStageII  -0.322440   0.078566  -4.104 4.06e-05 ***
## CancerStageIII -0.864059   0.102706  -8.413  < 2e-16 ***
## CancerStageIV  -2.166176   0.165898 -13.057  < 2e-16 ***
## LengthofStay   -0.041553   0.036459  -1.140  0.25440    
## Experience      0.125876   0.028147   4.472 7.75e-06 ***
## FamilyHxyes    -1.308942   0.095548 -13.699  < 2e-16 ***
## Age            -0.016080   0.006067  -2.651  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) IL6    CRP    CncSII CnSIII CncSIV LngthS Exprnc FmlyHx
## IL6         -0.070                                                        
## CRP         -0.084 -0.001                                                 
## CancerStgII  0.083  0.005 -0.002                                          
## CancrStgIII  0.138  0.010  0.012  0.514                                   
## CancerStgIV  0.145  0.036  0.010  0.359  0.351                            
## LengthofSty -0.137  0.012 -0.026 -0.186 -0.240 -0.194                     
## Experience  -0.842 -0.007 -0.003 -0.002 -0.006 -0.013 -0.004              
## FamilyHxyes -0.015  0.022  0.016 -0.053 -0.051 -0.009 -0.112 -0.023       
## Age         -0.385 -0.011  0.015 -0.181 -0.222 -0.221 -0.321 -0.009  0.105

Picking off OR’s and CI’s

library(lme4)

## Loading required package: Matrix

se <- sqrt(diag(as.matrix(vcov(m1))))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(m1), LL = fixef(m1) - 1.96 * se, 
             UL = fixef(m1) + 1.96 *se)
exp(tab)

##                      Est         LL        UL
## (Intercept)    0.1952642 0.06138961 0.6210840
## IL6            0.9433620 0.92190339 0.9653201
## CRP            0.9770285 0.95734781 0.9971139
## CancerStageII  0.7243793 0.62099623 0.8449735
## CancerStageIII 0.4214479 0.34460288 0.5154289
## CancerStageIV  0.1146150 0.08279918 0.1586563
## LengthofStay   0.9592985 0.89314018 1.0303574
## Experience     1.1341411 1.07326665 1.1984683
## FamilyHxyes    0.2701058 0.22397609 0.3257363
## Age            0.9840483 0.97241667 0.9958191

Getting predicted probabilities of remission

Now suppose we want to compare doctors. One way to do this is to look at predicted probabilities of remission for the different doctors, conditional on a set of prespecified patient characteristics (we do this so that a doctor is not penalized for attracting patients with more serious disease). In this case, we may look at median values of patient characteristics in our overall study and then predict probabilities of remission for this type of patient for each doctor.

tmpmat=lung[,c("Experience","DID")]
docmat=unique(tmpmat)
newn=length(unique(lung$DID))
new=data.frame(IL6=rep(3.344,newn),CRP=rep(4.333,newn),CancerStage=rep("II",newn),
               LengthofStay=rep(5,newn),FamilyHx=rep("no",newn),Age=rep(50.97,newn),
               Experience=docmat$Experience,DID=docmat$DID)
probremiss=predict(m1,newdata=new,type="response")
summary(probremiss)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01509 0.08705 0.25879 0.35103 0.58598 0.98707

Plots

We can also plot the doctor-specific effects; note that these are centered around zero, with positive values indicating doctors whose patients are more likely to experience remission, and negative values those whose patients are less likely to experience remission.

#turn off doctor ID's due to inability to read all
lattice::dotplot(ranef(m1,which="DID",condVar=TRUE), 
                 scales = list(y = list(alternating = 0)))

Plots

## $DID

3-level model

The model we fit to the data is commonly called a two-level model, with levels for patient and doctor. We can also fit a three-level model, adding the hospital to the mix.

m2=glmer(remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + FamilyHx
         + Age+ (1 | DID) + (1|HID), family = binomial, data=lung, 
         control=glmerControl(optimizer="bobyqa"),optCtrl=list(maxfun=2e5)) 
summary(m2)

3-level model

## Warning: extra argument(s) 'optCtrl' disregarded

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience +  
##     FamilyHx + Age + (1 | DID) + (1 | HID)
##    Data: lung
## Control: glmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   7198.9   7283.5  -3587.5   7174.9     8513 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1298 -0.4243 -0.1886  0.3576  7.6201 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  DID    (Intercept) 3.843    1.960   
##  HID    (Intercept) 0.253    0.503   
## Number of obs: 8525, groups:  DID, 407; HID, 35
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.624335   0.572807  -2.836  0.00457 ** 
## IL6            -0.058407   0.011459  -5.097 3.45e-07 ***
## CRP            -0.023181   0.010142  -2.286  0.02228 *  
## CancerStageII  -0.321448   0.076596  -4.197 2.71e-05 ***
## CancerStageIII -0.861103   0.100320  -8.584  < 2e-16 ***
## CancerStageIV  -2.156594   0.162995 -13.231  < 2e-16 ***
## LengthofStay   -0.041712   0.035596  -1.172  0.24127    
## Experience      0.126236   0.026895   4.694 2.68e-06 ***
## FamilyHxyes    -1.305410   0.093629 -13.942  < 2e-16 ***
## Age            -0.016023   0.005925  -2.704  0.00684 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) IL6    CRP    CncSII CnSIII CncSIV LngthS Exprnc FmlyHx
## IL6         -0.071                                                        
## CRP         -0.084  0.000                                                 
## CancerStgII  0.083  0.006 -0.002                                          
## CancrStgIII  0.138  0.011  0.012  0.513                                   
## CancerStgIV  0.144  0.036  0.009  0.356  0.347                            
## LengthofSty -0.139  0.012 -0.025 -0.185 -0.238 -0.192                     
## Experience  -0.828 -0.007 -0.003 -0.002 -0.006 -0.012 -0.004              
## FamilyHxyes -0.016  0.022  0.015 -0.054 -0.051 -0.010 -0.112 -0.022       
## Age         -0.388 -0.011  0.014 -0.183 -0.223 -0.221 -0.321 -0.009  0.105

Examining hospital effects

lattice::dotplot(ranef(m2,which="HID",condVar=TRUE))

## $HID