A large HMO wants to know what patient and physician factors are most related to whether a patient’s lung cancer goes into remission after treatment as part of a larger study of treatment outcomes and quality of life in patients with lunger cancer.
We consider simulated data that mimic real data relevant to this question. This simulation was created by the UCLA Institute for Digital Research and Education (IDRE), which has great examples online of many different analyses. Check it out!
In this hypothetical study, a variety of information was collected from patients, including the names of their doctors and hospitals. One can think of a nested structure in which a given hospital has many doctors, and a given doctor in the hospital has many patients in that hospital.
Let’s read in the data and do some basic processing.
lung=read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
lung=within(lung, {
Married=factor(Married,levels=0:1,labels=c("no","yes"))
remission=factor(remission,levels=0:1,labels=c("no","yes"))
DID=factor(DID) # doctor ID
HID=factor(HID) # hospital ID
})
Let’s look at basic descriptive statistics using the R function summary
. Because we have a fair number of variables in these data, instead of typing summary(varname)
for every single variable, it’s easier to use the handy sapply
function. The sapply
function can be used to apply another function (like summary) to every element of a data frame as follows.
sapply(lung,summary,na.rm=TRUE)
## $tumorsize
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.97 62.49 70.07 70.88 79.02 116.46
##
## $co2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.222 1.519 1.601 1.605 1.687 2.128
##
## $pain
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.473 6.000 9.000
##
## $wound
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 5.732 7.000 9.000
##
## $mobility
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 6.00 6.08 7.00 9.00
##
## $ntumors
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.066 5.000 9.000
##
## $nmorphine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.624 5.000 18.000
##
## $remission
## no yes
## 6004 2521
##
## $lungcapacity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01612 0.67647 0.81560 0.77409 0.91150 0.99980
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.32 46.69 50.93 50.97 55.27 74.48
##
## $Married
## no yes
## 3410 5115
##
## $FamilyHx
## no yes
## 6820 1705
##
## $SmokingHx
## current former never
## 1705 1705 5115
##
## $Sex
## female male
## 5115 3410
##
## $CancerStage
## I II III IV
## 2558 3409 1705 853
##
## $LengthofStay
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 5.000 5.492 6.000 10.000
##
## $WBC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2131 5323 6007 5998 6663 9776
##
## $RBC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.919 4.802 4.994 4.995 5.190 6.065
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.38 24.20 27.73 29.07 32.54 58.00
##
## $IL6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03521 1.93039 3.34400 4.01698 5.40551 23.72777
##
## $CRP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0451 2.6968 4.3330 4.9730 6.5952 28.7421
##
## $DID
## 69 76 86 104 258 271 289 296 306
## 40 40 40 40 40 40 40 40 40
## 320 349 376 45 176 183 184 284 291
## 40 40 40 39 39 39 39 39 39
## 305 355 368 74 100 154 159 179 190
## 39 39 39 38 38 38 38 38 38
## 215 217 225 265 338 358 370 375 393
## 38 38 38 38 38 38 38 38 38
## 396 26 145 208 369 405 35 41 59
## 38 37 37 37 37 37 36 36 36
## 127 130 141 148 150 248 324 395 17
## 36 36 36 36 36 36 36 36 35
## 42 115 138 195 290 322 343 6 37
## 35 35 35 35 35 35 35 34 34
## 137 163 278 328 331 353 385 394 61
## 34 34 34 34 34 34 34 34 33
## 64 180 273 292 299 352 401 2 8
## 33 33 33 33 33 33 33 32 32
## 11 44 60 129 165 229 245 293 318
## 32 32 32 32 32 32 32 32 32
## 373 379 33 50 90 113 152 178 216
## 32 32 31 31 31 31 31 31 31
## (Other)
## 4988
##
## $Experience
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 15.00 18.00 17.64 21.00 29.00
##
## $School
## average top
## 6405 2120
##
## $Lawsuits
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.866 3.000 9.000
##
## $HID
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 252 202 227 377 135 356 322 224 302 174 157 213 312 255 276 261 243 307
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 248 211 208 134 246 209 293 222 153 229 300 217 219 291 271 183 296
##
## $Medicaid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1416 0.3369 0.5215 0.5125 0.7083 0.8187
There are so many doctor ID’s that R stopped printing. Let’s count them just so that we know. At the same time we will see how many observations are in our data.
length(lung$DID) #how many rows in data (i.e., people)
## [1] 8525
length(unique(lung$DID)) # how many doctors in data
## [1] 407
length(unique(lung$HID)) # how many hospitals in data
## [1] 35
The lung cancer dataset contains variables at the patient, doctor, and hospital level.
At the hospital level, we have a hospital identifier and the proportion of patients in the hospital who are insured by Medicaid.
At the physician level, we have a doctor ID, the number of years of experience held by the doctor, an indicator of medical school quality (average or high), and the number of lawsuits against the doctor.
Other variables are at the patient level and include the outcome of remission (1=yes, 0=no), tumor and other health characteristics, demographic and medical history variables, and the length of stay in the current hospital visit.
We will look at the distribution of covariates for those participants who achieved remission and those who did not.
library(reshape2)
library(ggplot2)
tmp <- melt(lung[, c("remission", "IL6", "CRP", "LengthofStay", "co2")],
id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
geom_boxplot() +
facet_wrap(~variable, scales="free_y")
tmp <- melt(lung[, c("remission", "tumorsize", "ntumors", "lungcapacity", "pain")],
id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
geom_boxplot() +
facet_wrap(~variable, scales="free_y")
tmp <- melt(lung[, c("remission", "Age","Experience","Lawsuits","Medicaid")],
id.vars="remission")
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
geom_boxplot() +
facet_wrap(~variable, scales="free_y")
We can also examine how categorical predictors are associated with remission.
prop.table(table(lung$ntumors,lung$remission),1)
##
## no yes
## 0 0.7482066 0.2517934
## 1 0.6840640 0.3159360
## 2 0.6666667 0.3333333
## 3 0.6621392 0.3378608
## 4 0.6847826 0.3152174
## 5 0.6978193 0.3021807
## 6 0.7072165 0.2927835
## 7 0.7412399 0.2587601
## 8 0.7854406 0.2145594
## 9 0.8348837 0.1651163
prop.table(table(lung$Married,lung$remission),1)
##
## no yes
## no 0.7149560 0.2850440
## yes 0.6971652 0.3028348
We can also examine how categorical predictors are associated with remission.
prop.table(table(lung$FamilyHx,lung$remission),1)
##
## no yes
## no 0.6693548 0.3306452
## yes 0.8439883 0.1560117
prop.table(table(lung$Sex,lung$remission),1)
##
## no yes
## female 0.7088954 0.2911046
## male 0.6973607 0.3026393
We can also examine how categorical predictors are associated with remission.
prop.table(table(lung$CancerStage,lung$remission),1)
##
## no yes
## I 0.6176701 0.3823299
## II 0.6914051 0.3085949
## III 0.7630499 0.2369501
## IV 0.8980070 0.1019930
prop.table(table(lung$School,lung$remission),1)
##
## no yes
## average 0.6927400 0.3072600
## top 0.7391509 0.2608491
Multilevel models can be used for binary outcomes (and those on other scales) using a similar approach to that used for normal data: we group coefficients into batches, and a probability distribution is assigned to each batch. Recall our normal data model: \[y_i \sim N(\alpha_{j[i]}+\beta x_i, \sigma_y^2), ~~~ \alpha_j \sim N\left(\gamma_0,\sigma_\alpha^2 \right).\] In this case we grouped the county-level intercepts and assigned them a normal distribution. This method is a compromise between complete pooling across counties (same intercept for each county) and no pooling (estimating a separate intercept for each county without borrowing information), where the degree of pooling is determined by the amount of information within and between counties.
Another way to think of this model is as a way of handling correlated responses – due to geography (in the radon case), we expect homes closer together to have more similar radon levels than homes further apart. One rough way to account for this spatial correlation is by including a random effect, \(\alpha_j\), for county. The model under each motivation (adjusting for correlation within counties, or borrowing information across counties) is the same.
Let’s examine the probability of remission as a function of IL6, CRP, cancer stage, length of stay, family history, and physician experience. We’ll have a random effect for doctor to control for possible correlation within physician.
library(lme4)
m1=glmer(remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + FamilyHx
+ (1 | DID), family = binomial, data=lung,
control=glmerControl(optimizer="bobyqa"),
nAGQ=10,optCtrl=list(maxfun=2e5))
summary(m1)
#these models can be flaky. Specifying a different optimizer to avoid
# convergence problems with default
Let’s examine the probability of remission as a function of IL6, CRP, cancer stage, length of stay, family history, and physician experience. We’ll have a random effect for doctor to control for possible correlation within physician.
## Warning: extra argument(s) 'optCtrl' disregarded
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
## Family: binomial ( logit )
## Formula:
## remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience +
## FamilyHx + Age + (1 | DID)
## Data: lung
## Control: glmerControl(optimizer = "bobyqa")
##
## AIC BIC logLik deviance df.resid
## 7188.4 7266.0 -3583.2 7166.4 8514
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1656 -0.4218 -0.1859 0.3562 7.9281
##
## Random effects:
## Groups Name Variance Std.Dev.
## DID (Intercept) 4.275 2.068
## Number of obs: 8525, groups: DID, 407
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.633402 0.590364 -2.767 0.00566 **
## IL6 -0.058305 0.011740 -4.967 6.82e-07 ***
## CRP -0.023239 0.010382 -2.238 0.02520 *
## CancerStageII -0.322440 0.078566 -4.104 4.06e-05 ***
## CancerStageIII -0.864059 0.102706 -8.413 < 2e-16 ***
## CancerStageIV -2.166176 0.165898 -13.057 < 2e-16 ***
## LengthofStay -0.041553 0.036459 -1.140 0.25440
## Experience 0.125876 0.028147 4.472 7.75e-06 ***
## FamilyHxyes -1.308942 0.095548 -13.699 < 2e-16 ***
## Age -0.016080 0.006067 -2.651 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) IL6 CRP CncSII CnSIII CncSIV LngthS Exprnc FmlyHx
## IL6 -0.070
## CRP -0.084 -0.001
## CancerStgII 0.083 0.005 -0.002
## CancrStgIII 0.138 0.010 0.012 0.514
## CancerStgIV 0.145 0.036 0.010 0.359 0.351
## LengthofSty -0.137 0.012 -0.026 -0.186 -0.240 -0.194
## Experience -0.842 -0.007 -0.003 -0.002 -0.006 -0.013 -0.004
## FamilyHxyes -0.015 0.022 0.016 -0.053 -0.051 -0.009 -0.112 -0.023
## Age -0.385 -0.011 0.015 -0.181 -0.222 -0.221 -0.321 -0.009 0.105
library(lme4)
## Loading required package: Matrix
se <- sqrt(diag(as.matrix(vcov(m1))))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(m1), LL = fixef(m1) - 1.96 * se,
UL = fixef(m1) + 1.96 *se)
exp(tab)
## Est LL UL
## (Intercept) 0.1952642 0.06138961 0.6210840
## IL6 0.9433620 0.92190339 0.9653201
## CRP 0.9770285 0.95734781 0.9971139
## CancerStageII 0.7243793 0.62099623 0.8449735
## CancerStageIII 0.4214479 0.34460288 0.5154289
## CancerStageIV 0.1146150 0.08279918 0.1586563
## LengthofStay 0.9592985 0.89314018 1.0303574
## Experience 1.1341411 1.07326665 1.1984683
## FamilyHxyes 0.2701058 0.22397609 0.3257363
## Age 0.9840483 0.97241667 0.9958191
Now suppose we want to compare doctors. One way to do this is to look at predicted probabilities of remission for the different doctors, conditional on a set of prespecified patient characteristics (we do this so that a doctor is not penalized for attracting patients with more serious disease). In this case, we may look at median values of patient characteristics in our overall study and then predict probabilities of remission for this type of patient for each doctor.
tmpmat=lung[,c("Experience","DID")]
docmat=unique(tmpmat)
newn=length(unique(lung$DID))
new=data.frame(IL6=rep(3.344,newn),CRP=rep(4.333,newn),CancerStage=rep("II",newn),
LengthofStay=rep(5,newn),FamilyHx=rep("no",newn),Age=rep(50.97,newn),
Experience=docmat$Experience,DID=docmat$DID)
probremiss=predict(m1,newdata=new,type="response")
summary(probremiss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01509 0.08705 0.25879 0.35103 0.58598 0.98707
We can also plot the doctor-specific effects; note that these are centered around zero, with positive values indicating doctors whose patients are more likely to experience remission, and negative values those whose patients are less likely to experience remission.
#turn off doctor ID's due to inability to read all
lattice::dotplot(ranef(m1,which="DID",condVar=TRUE),
scales = list(y = list(alternating = 0)))
## $DID
The model we fit to the data is commonly called a two-level model, with levels for patient and doctor. We can also fit a three-level model, adding the hospital to the mix.
m2=glmer(remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + FamilyHx
+ Age+ (1 | DID) + (1|HID), family = binomial, data=lung,
control=glmerControl(optimizer="bobyqa"),optCtrl=list(maxfun=2e5))
summary(m2)
## Warning: extra argument(s) 'optCtrl' disregarded
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience +
## FamilyHx + Age + (1 | DID) + (1 | HID)
## Data: lung
## Control: glmerControl(optimizer = "bobyqa")
##
## AIC BIC logLik deviance df.resid
## 7198.9 7283.5 -3587.5 7174.9 8513
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1298 -0.4243 -0.1886 0.3576 7.6201
##
## Random effects:
## Groups Name Variance Std.Dev.
## DID (Intercept) 3.843 1.960
## HID (Intercept) 0.253 0.503
## Number of obs: 8525, groups: DID, 407; HID, 35
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.624335 0.572807 -2.836 0.00457 **
## IL6 -0.058407 0.011459 -5.097 3.45e-07 ***
## CRP -0.023181 0.010142 -2.286 0.02228 *
## CancerStageII -0.321448 0.076596 -4.197 2.71e-05 ***
## CancerStageIII -0.861103 0.100320 -8.584 < 2e-16 ***
## CancerStageIV -2.156594 0.162995 -13.231 < 2e-16 ***
## LengthofStay -0.041712 0.035596 -1.172 0.24127
## Experience 0.126236 0.026895 4.694 2.68e-06 ***
## FamilyHxyes -1.305410 0.093629 -13.942 < 2e-16 ***
## Age -0.016023 0.005925 -2.704 0.00684 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) IL6 CRP CncSII CnSIII CncSIV LngthS Exprnc FmlyHx
## IL6 -0.071
## CRP -0.084 0.000
## CancerStgII 0.083 0.006 -0.002
## CancrStgIII 0.138 0.011 0.012 0.513
## CancerStgIV 0.144 0.036 0.009 0.356 0.347
## LengthofSty -0.139 0.012 -0.025 -0.185 -0.238 -0.192
## Experience -0.828 -0.007 -0.003 -0.002 -0.006 -0.012 -0.004
## FamilyHxyes -0.016 0.022 0.015 -0.054 -0.051 -0.010 -0.112 -0.022
## Age -0.388 -0.011 0.014 -0.183 -0.223 -0.221 -0.321 -0.009 0.105
lattice::dotplot(ranef(m2,which="HID",condVar=TRUE))
## $HID