library(readstata13)
dep=read.dta13("~/Documents/TEACHING/SHARE-Child data/
SC_anonymized_forAmy_20180314.dta")
#PHQ>9 a sign of potential moderate to severe depression
dep$depressed=dep$phqTotal>9; dep$depressedbin=as.numeric(dep$depressed)
dep$educat=cut(dep$momEduc,breaks=c(-.001,12,16))
dep$agecat=cut(dep$momAge,breaks=c(15,30,41))
tabtmp=count(dep,c("timepoint","depressed"))
ggplot(tabtmp,aes(x=timepoint,y=freq,fill=as.factor(depressed)))+
geom_col(aes(fill=as.factor(depressed)))
## Warning in read.dta13("~/Documents/TEACHING/SHARE-Child data/SC_anonymized_forAmy_20180314.dta"):
## timepoint:
## Factor codes of type double or float detected - no labels assigned.
## Set option nonint.factors to TRUE to assign labels anyway.
Here we see the prevalence of depression decreases some after baseline, and there is a fair amount of missing data after baseline as well.
We can test whether there is a significant decrease in depression levels after baseline using a longitudinal analysis. This analysis is needed because data are paired in a complicated manner: the participants at time 1 are a subset of those at baseline (time 0), the participants at time 2 are a subset of those at baseline, but there is no particular nesting structure among those measured at time 1 and time 2, complicating any paired tests we might conduct. We need to think about models not just for the mean at a given time, but also for the association within person.
Marginal models specify the mean and covariance models separately. We will assume that the marginal distribution for the response \(Y_{ij}\) of subject \(i\) at time \(j\) is given by \[Y_{ij} = \exp \Big[\frac{Y_{ij}\theta_{ij} -b(\theta_{ij})}{{\phi}}-c(Y_{ij},\phi)\Big],\] so that each \(Y_{ij}\) has a distribution from the exponential family. We model the marginal expectation of each response, \(E(Y_{ij})\), as a function of covariates of interest.
Yikes! What does this mean for some models we know about?
In marginal models, parameters have population-averaged interpretations. It is important to note that the type or size of the correlation does not affect the interpretation of \(\beta\). The regression parameters \(\beta\) describe the effect of covariates on the average responses.
Liang and Zeger (1986) proposed an alternative to maximum likelihood for estimating marginal models based on the concept of estimating equations. Their approach is a single general and unified method for analyzing discrete (or continuous) responses using marginal models.
Nice aspects of this approach include simple interpretations and easy estimation methods.
Drawbacks are that the emphasis is on describing trends in the mean response, missing data is assumed to be missing completely at random, that estimates of correlation are not a focus (if that’s your goal, try a multilevel model instead).
To fit a GEE, you specify three components.
GEE is robust to misspecification of the working correlation structure if the empirical (a.k.a. “sandwich”) covariance estimator is used, but estimates will be most precise when you have specified the correct structure. However, this “insurance” is nice when you do not have a lot of data about what the association among repeated measures should look like.
With GEE we have two options for covariance estimation: model-based and empirical estimators.
If we correctly specify the variance-covariance model, we can use the model-based variance estimator in order to conduct inferences. We would expect to gain some power in this situation. However, if the covariance model is reasonable and the sample size is large enough, the empirical and model-based standard errors should not differ greatly. We note that you want to have around \(N=100\) subjects to use the empirical covariance estimator with binary data.
For the depression data, a simple model would evaluate the odds of depression as a function of time time period of measurement. This model is specified as
\[\text{logit}(Pr(y_{ij}=1))=\beta_0+\beta_1I(time_i=1)+\beta_2I(time_i=2),\] with a Bernoulli variance function. For simplicity, we will assume a compound symmetric (exchangeable) working correlation, which implies \(\text{Corr}(y_{i1},y_{i2})=\text{Corr}(y_{i1},y_{i3})=\text{Corr}(y_{i2},y_{i3})\).
library(geepack)
dep2=select(dep,depressedbin,timepoint,anon_id,educat)
#geepack needs NA values to be dropped
newdep=na.omit(dep2)
#id is used to identify the repeated measures (independent units)
#waves tells R the ordering of the observations within participant
m1=geeglm(depressedbin~factor(timepoint),id=as.factor(anon_id),family=binomial(),
corstr="exchangeable",data=newdep,waves=timepoint)
summary(m1)
##
## Call:
## geeglm(formula = depressedbin ~ factor(timepoint), family = binomial(),
## data = newdep, id = as.factor(anon_id), waves = timepoint,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.1613 0.1521 1.124 0.289
## factor(timepoint)1 -1.1594 0.2519 21.187 4.17e-06 ***
## factor(timepoint)2 -1.2619 0.2429 26.987 2.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.9985 0.09178
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.1585 0.06278
## Number of clusters: 174 Maximum cluster size: 3
It is easy to use the robust standard errors to calculate odds ratios and confidence intervals.
m1est=coef(summary(m1))$Estimate
m1se=coef(summary(m1))$Std.err
cbind(exp(m1est),exp(m1est-1.96*m1se),exp(m1est+1.96*m1se))[2:3,]
## [,1] [,2] [,3]
## [1,] 0.3137 0.1914 0.5139
## [2,] 0.2831 0.1759 0.4558
Clearly, the odds of depression are lower after the baseline measurement. For example, the odds of depression at time 1 are only 0.31 (95%CI=(0.19, 0.51)) times the odds of depression at baseline.
In geepack
, it is easy to request a “chunk test” of the significance of an entire group of predictors using the anova
command. This test is a multivariate Wald test (the GEE model does not involve a full likelihood, and we do not get likelihood ratio tests).
anova(m1)
## Analysis of 'Wald statistic' Table
## Model: binomial, link: logit
## Response: depressedbin
## Terms added sequentially (first to last)
##
## Df X2 P(>|Chi|)
## factor(timepoint) 2 33.2 6.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value listed is for the test of the hypothesis \(H_0: \beta_1=\beta_2=0\), which tests whether there is any difference in the depression scores over time. In this case, clearly there is a change in depression scores over time.
We can use our previous trick of resetting the referent level in order to evaluate whether there is a difference in depression levels at the two post-baseline measures.
newdep$time2ref=relevel(as.factor(newdep$timepoint),ref="2")
m1b=geeglm(depressedbin~time2ref,id=as.factor(anon_id),family=binomial(),
corstr="exchangeable",data=newdep,waves=timepoint)
summary(m1b)
##
## Call:
## geeglm(formula = depressedbin ~ time2ref, family = binomial(),
## data = newdep, id = as.factor(anon_id), waves = timepoint,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.423 0.209 46.35 9.9e-12 ***
## time2ref0 1.262 0.243 26.99 2.0e-07 ***
## time2ref1 0.102 0.257 0.16 0.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.998 0.0918
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.159 0.0628
## Number of clusters: 174 Maximum cluster size: 3
We see the estimate for the time point 1 (compared to the referent of time 2) is not statistically significant, indicating there is no difference in depression levels at the two post-baseline occasions.
Next, we may wish to evaluate whether there are any differences in depression levels as a function of maternal education. To start, we can fit a model that allows a different probability of depression according to whether a woman has \(>12\) years of education or not.
\[\text{logit}(Pr(y_{ij}=1))=\beta_0+\beta_1I(time_i=1)+\beta_2I(time_i=2)+\beta_3I(edu_i>12),\]
m2=geeglm(depressedbin~as.factor(timepoint)+educat,id=as.factor(anon_id),
family=binomial(),corstr="exchangeable",data=newdep,waves=timepoint)
summary(m2)
##
## Call:
## geeglm(formula = depressedbin ~ as.factor(timepoint) + educat,
## family = binomial(), data = newdep, id = as.factor(anon_id),
## waves = timepoint, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.132 0.162 0.66 0.42
## as.factor(timepoint)1 -1.164 0.252 21.32 3.9e-06 ***
## as.factor(timepoint)2 -1.265 0.243 27.13 1.9e-07 ***
## educat(12,16] -0.207 0.306 0.46 0.50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.0921
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.157 0.0616
## Number of clusters: 174 Maximum cluster size: 3
Examining the p-value, we do not see evidence of an association between having \(>\) 12 years of education and depression level.
One possibility we have not checked is that education has a different impact on depression risk at each time point. We can evaluate this using an interaction term as follows.
\[\begin{eqnarray*} \text{logit}(Pr(y_{ij}=1))&=&\beta_0+\beta_1I(time_i=1)+\beta_2I(time_i=2) \\ & & +\beta_3I(edu_i>12) + \beta_4I(time_i=1)I(edu_i>12) \\ & & + \beta_5I(time_i=2)I(edu_i>12) \end{eqnarray*}\]m3=geeglm(depressedbin~as.factor(timepoint)*educat,id=as.factor(anon_id),
family=binomial(),corstr="exchangeable",data=newdep,waves=timepoint)
summary(m3)
##
## Call:
## geeglm(formula = depressedbin ~ as.factor(timepoint) * educat,
## family = binomial(), data = newdep, id = as.factor(anon_id),
## waves = timepoint, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.1210 0.1641 0.54 0.46
## as.factor(timepoint)1 -1.2171 0.2627 21.47 3.6e-06 ***
## as.factor(timepoint)2 -1.2577 0.2532 24.66 6.8e-07 ***
## educat(12,16] -0.2845 0.4400 0.42 0.52
## as.factor(timepoint)1:educat(12,16] 0.4261 0.8568 0.25 0.62
## as.factor(timepoint)2:educat(12,16] -0.0759 0.8551 0.01 0.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.998 0.0937
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.158 0.0617
## Number of clusters: 174 Maximum cluster size: 3
anova(m3)
## Analysis of 'Wald statistic' Table
## Model: binomial, link: logit
## Response: depressedbin
## Terms added sequentially (first to last)
##
## Df X2 P(>|Chi|)
## as.factor(timepoint) 2 33.2 6.2e-08 ***
## educat 1 0.5 0.50
## as.factor(timepoint):educat 2 0.4 0.83
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We don’t see any evidence of any interaction here. It looks like education (at least in terms of our parameterization here) is unrelated to depression levels and is also unrelated to any potential change in depression levels over time.