Missing Data

Motivating Example: Pregnancy, Infection, & Nutrition (PIN) Study

Prospective study of risk factors for adverse birth outcomes in central NC prenatal care clinics
We would like to study how cocaine use relates to small-for-gestational-age (SGA) births
Covariates: maternal age, race, income, parity, and education
Cocaine measured on questionnaire, in urine, and in hair

Mathematical notation

Let \(Y\) = \((Y_{obs},Y_{mis})\)
\(r_i = 1\) if data for unit \(i\) is missing, \(r_i = 0\) otherwise
\(R = (r_1,..., r_n)\)
\(\theta\) denote the parameters associated with \(Y\)
\(\psi\) denote the parameters associated with \(R\)
this assumes we have just one variable subject to missingness; in practice we may have multiple variables per subject and thus multiple indicators of missingness per subject (one per variable)

Types/Mechanisms of Missing Data

Missing completely at random (MCAR)

Subjects with missing data are a completely random sample of study subjects
e.g. hair cocaine missing for subset of women who were never asked to provide it for study logistical reasons that are not related to the probability of detecting cocaine in hair
in mathematical notation \(f(R|Y,\theta, \psi) = f(R|\psi)\)

Missing at random (MAR)

Missingness may depend on observed variables, but not on the missing values themselves
e.g., hair samples missing more often for older women or for women with very short hair (and we record age and hair length in study)
in mathematical notation \(f(R|Y,\theta, \psi) = f(R|Y_{obs},\psi)\)
in the PIN study, there was some MAR data due to the protocol, which specified women needed hair of a certain length (around an inch) in order to have sufficient quantities to assay

Missing not at random/nonignorable (MNAR)

Missingness may depend on the missing values themselves and may also depend on observed values
e.g., women who have used cocaine recently are less likely to provide biospecimens for drug testing
mathematcally, we have \(f(R|Y,\theta, \psi) = f(R|Y_{obs},Y_{mis},\psi)\)

Methods of Handling Missing Data

Complete case analysis

Use only the set of observations with no missing values
When missingness is MCAR, then the complete case estimator is unbiased
Often good approach when percentage of data missing is small (\(<\) 5-10%)
If there are many different variables with missing values, a large fraction of the observations may be dropped, resulting in inefficient use of information even under this mechanism

Ad hoc methods

Creation of an indicator of missingness as a new variable
Simple imputation with mean of observed values, or prediction from regression model
Last value carried forward
Hot deck imputation: replace missing value with an observed response from a “similar” subject
Easy to implement, but have the potential to induce bias, not recommended

Maximum likelihood

Involves specification of model for any variables subject to missing data
Then one can integrate the likelihood over the missing values, obtaining the likelihood of the observed data (observed data likelihood)
Then we maximize the observed data likelihood
Can be quite computationally intensive
Done automatically in multilevel/mixed effects models though assumes MCAR or MAR missing data

Multiple imputation

Generate \(p\) possible values for each missing observation (typically 5-10 imputated datasets are created)
Each of these datasets is analyzed using complete-data methods
Combine the results of \(p\) separate analyses, allowing the uncertainty regarding the imputation to be taken into account
Often the easiest solution
Software implementations offer a variety of choices for models from which to impute data (many flavors of MI are available, and MI can be used for any missing data mechanism, though software may restrict this)
Bayesian inference can easily accommodate “imputation” directly in sampling algorithms when distribution of variables subject to missingness specified (treat missing data as parameters to estimate)

MICE

Multiple Imputation using Chained Equations
Also known as “fully conditional specification” or “sequential regression multiple imputation”
Involves a variable-by-variable approach for a set of variables \((Y_1,\ldots,Y_q)\) for each participant
Imputation model is specified separately for each variable, using the other variables as predictors, e.g. specify \(f(Y_j|Y_{(-j)})\), where \(Y_{(-j)}=(Y_1,Y_2, ... ,Y_{j-1},Y_{j+1}, ... ,Y_q)\)
At each stage of the algorithm, an imputation is generated for the missing variable, then this imputed value is used in the imputation of the next variable
Repeat this process to obtain one imputed dataset
Then impute multiple datasets

Example

The National Health and Nutrition Examination Survey (NHANES) is a critical public health survey, funded by the US government, that provides valuable information about the health and nutritional status of people in the US. The survey examines a nationally-representative sample of about 5000 people taken from 15 counties each year. This survey is used for a wide variety of purposes, including determining the prevalence of major diseases and the prevalences of some important risk factors.

We’ll walk through a small example in R with a small subset of the NHANES dataset. It contains 25 obs and four variables: age (age groups: 1=20-39, 2=40-59, 3=60+), bmi (body mass index, in kg/m\(^2\)), hyp (hypertension status, 1=no, 2=yes) and chl (total cholesterol, in mg/dL).

# required libraries
library(mice)
library(VIM)
library(lattice)


# load data
data(nhanes)
str(nhanes)

## 'data.frame':    25 obs. of  4 variables:
##  $ age: num  1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: num  NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

nhanes$age <- factor(nhanes$age)

Here the missing data are indicated by the value \(NA\) in the data set. Note other codes are often used for missing values, including negative numbers (for typically positive variables), or digits like 9, 99, 999 (depending on the scale of the variable missing). Data exploration and cleaning are always important first steps in any analysis to ensure you’re not treating a value of 99 as someone’s BMI or 999 as their cholesterol!

Patterns of missing data

md.pattern(nhanes)

##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27

- 5 patterns observed from \(2^3\) possible patterns

Patterns of missing data

##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27

- 1st column gives pattern count (subjects) - last column gives # variables missing in each pattern - last row gives total number of missing values (not subjects) - e.g., 3 cases in which only cholesterol is missing - age always observed - complete case analysis would use the 52% of data that are observed (pretty high fraction of missingness)

Visualizing missing data patterns

The VIM package provides nice visualizations of the fraction of data missing and missing data patterns. Blue is used for observed data and red for missing.

nhanes_aggr <- aggr(nhanes,col=mdc(1:2),numbers=TRUE,sortVars=TRUE,
labels=names(nhanes),cex.axis=.7, gap=3, 
ylab=c("Proportion missing","Missingness Pattern"))

Visualizing missing data patterns

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       chl  0.40
##       bmi  0.36
##       hyp  0.32
##       age  0.00

More visualization

Margin plots can be used to understand how missingness affects the distribution of values on other variables.

Blue boxplots summarize the distribution of observed data given the other variable is observed
Red box plots summarize the distribution of observed data given the other variable is missing
If data are MCAR, you expect the boxplots to be the same (hard to evaluate in this small sample)

More visualization

marginplot(nhanes[, c("chl", "bmi")], col = mdc(1:2), cex.numbers = 1.2, pch = 19)

Multiple imputation

imp <- mice(nhanes, m=20, printFlag=FALSE, maxit = 40, seed=2525) 
# The output imp contains m=20 completed datasets 
# Imputation by default used predictive mean matching 
# (we will discuss that method shortly!)
fit.mi <- with(data=imp, expr = lm(chl ~ age + bmi))
#Function with() used to fit a simple linear model to each imputed dataset
#Expr stands for model expression

Look at individual results

fit.mi$analyses[[1]] #see analysis of 1st imputed dataset

## 
## Call:
## lm(formula = chl ~ age + bmi)
## 
## Coefficients:
## (Intercept)         age2         age3          bmi  
##     -27.599       43.262       70.924        7.318

fit.mi$analyses[[20]] #see analysis of 20th imputed dataset

## 
## Call:
## lm(formula = chl ~ age + bmi)
## 
## Coefficients:
## (Intercept)         age2         age3          bmi  
##     -27.487       41.434       65.659        7.164

Combining estimates

Clearly, point estimates differ across imputations; we have \(\widehat{\beta}_1, ..., \widehat{\beta}_{20}\). Variance estimates will also differ across imputed datasets; letting the matrix \(U_j=\widehat{\text{Cov}}(\widehat{\beta}_j)\) then we have \(U_1, ..., U_{20}\).

What if age seems important in one imputed data set but not another? How do we combine estimates to make an overall conclusion?

Rubin’s rules dictate how estimates should be combined across imputed datasets.

Rubin’s rules

Our overall point estimate over \(m\) imputations is given by \[\widetilde{\beta}=\frac{1}{m} \sum_{j=1}^m \widehat{\beta}_j.\]

The average variance over imputations is given by \[\widetilde{U}=\frac{1}{m} \sum_{j=1}^m U_j,\] but this only tells part of the story, as we also need to consider the variance in the \(\widehat{\beta}_j\) across the \(m\) imputations, which is given by \[B=\frac{1}{m-1}\sum_{j=1}^m (\widehat{\beta}_j-\widetilde{\beta})'(\widehat{\beta}_j-\widetilde{\beta}).\]

Total variance

Rubin defines the total variance as \[T=\widetilde{U}+B(1+\frac{1}{m}),\] where

\(\widetilde{U}\) represents the contribution of the sampling variance,
\(B\) is the extra variance due to having missing data, and
\(\frac{1}{m}B\) is the extra simulation variance due to estimating our parameters using a finite \(m\).

Combining estimates across models

combFit <- pool(fit.mi) # Combine all the results of the 20 imputed datasets

round(summary(combFit),2)

##             estimate std.error statistic    df p.value
## (Intercept)     8.37     63.78      0.13 11.15    0.90
## age2           43.83     18.17      2.41 14.66    0.03
## age3           67.94     22.20      3.06 10.59    0.01
## bmi             5.85      2.27      2.58 10.43    0.03

Is this result similar to what you would obtain in a complete case analysis?

How was imputation done?

Many different methods can be used to impute missing data. For these data, we can see what was used (the default in this case):

imp$method

##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

Predictive mean matching (PMM, Rubin 1986) is a semiparametric method often used in multiple imputation. Advantages include matching the distribution of the original observed variable, as imputed values are taken from the real data.

Predictive mean matching

Suppose \(x\) is subject to missing values while \(z\) is completely observed.

Basic approach:

Using complete cases, estimate regression of \(x\) on \(z\), obtaining \(\widehat{\beta}\)
Draw a new \(\beta^*\) from the posterior predictive distribution of \(\widehat{\beta}\) (e.g., MVN)
Using \(\beta^*\), generate predicted values of \(x\) for all cases
For each case with a missing \(x\), identify set of donors with observed \(x\) who have predicted \(x\) values close to that of the case with missing data
From among these cases, randomly select one and assign its observed value of \(x\) as the imputed value
Repeat for all observations and imputation data sets

Increasing number of imputations

Increase the number of imputations to \(m=50\) to see whether we get similar results

imp50 <- mice(nhanes, m=50, printFlag=FALSE, maxit = 40, seed=2525)
fit.mi50 <- with(data=imp50, expr = lm(chl ~ age + bmi))
combFit <- pool(fit.mi50)
round(summary(combFit),2)

##             estimate std.error statistic    df p.value
## (Intercept)     6.06     63.31      0.10 13.01    0.93
## age2           45.08     19.77      2.28 13.04    0.04
## age3           66.89     23.02      2.91 11.09    0.01
## bmi             5.94      2.18      2.73 13.31    0.02

Results are similar. One rule of thumb is that the # of imputations should roughly equal the % of missing data.

Examine imputations created by MICE

# Extract the imputations for bmi
imp$imp$bmi

##       1    2    3    4    5    6    7    8    9   10   11   12   13   14
## 1  30.1 20.4 35.3 22.0 27.2 27.4 22.7 35.3 33.2 30.1 35.3 22.0 30.1 35.3
## 3  30.1 22.0 27.2 26.3 30.1 29.6 26.3 26.3 30.1 26.3 27.2 35.3 30.1 33.2
## 4  27.4 24.9 29.6 21.7 21.7 26.3 25.5 27.5 22.0 22.0 26.3 27.4 30.1 27.4
## 6  21.7 27.5 21.7 25.5 22.7 22.7 27.4 27.4 25.5 22.0 27.5 27.4 22.7 20.4
## 10 27.2 26.3 27.4 28.7 22.5 33.2 22.0 22.7 27.5 22.7 26.3 27.2 28.7 28.7
## 11 27.2 22.0 30.1 22.7 33.2 27.2 22.7 29.6 27.2 26.3 35.3 22.0 30.1 27.2
## 12 22.7 30.1 27.5 21.7 22.7 27.5 26.3 22.0 25.5 22.7 27.5 27.5 22.7 21.7
## 16 27.2 35.3 27.2 21.7 33.2 35.3 28.7 25.5 35.3 27.2 33.2 22.0 22.0 29.6
## 21 27.2 22.0 30.1 22.5 35.3 35.3 26.3 26.3 25.5 30.1 28.7 35.3 35.3 29.6
##      15   16   17   18   19   20
## 1  27.2 33.2 30.1 22.0 27.2 27.2
## 3  29.6 22.0 25.5 27.5 30.1 27.2
## 4  27.5 27.4 20.4 33.2 20.4 24.9
## 6  22.7 24.9 28.7 27.5 20.4 22.5
## 10 28.7 27.4 29.6 25.5 22.5 22.5
## 11 30.1 30.1 35.3 22.5 35.3 29.6
## 12 22.5 26.3 26.3 28.7 22.7 22.7
## 16 28.7 35.3 27.2 25.5 29.6 26.3
## 21 27.2 22.7 30.1 25.5 30.1 27.2

Examine imputations created by MICE

# Extract the second complete dataset
imp_2 <- complete(imp, 2)
head(imp_2)

##   age  bmi hyp chl
## 1   1 20.4   1 118
## 2   2 22.7   1 187
## 3   1 22.0   1 187
## 4   3 24.9   2 284
## 5   1 20.4   1 113
## 6   3 27.5   2 184

Checking imputations visually

# scatterplot (of chl and bmi) for each imputed dataset
xyplot(imp, bmi ~ chl | .imp, pch = 20, cex = 1.4)

We expect the red points (imputed data) have almost the same shape as blue points. Blue points are constant across imputed datasets, but red points differ from each other, which represents our uncertainty about the true values of missing data.

Checking imputations visually

# To detect interesting differences in distribution between observed and imputed data
densityplot(imp)

What if we only know BMI as a binary indicator?

# Create an indicator variable of BMI, coded as 1 if BMI >= 25, 0 if BMI<25, 
# NA if BMI is missing
nhanes$overwt <- ifelse(is.na(nhanes$bmi),NA,ifelse(nhanes$bmi<25,0,1))
nhanes_new <- nhanes[,-which(names(nhanes) %in% "bmi")]
nhanes_new$overwt <- as.factor(nhanes_new$overwt)

# MICE
imp_new <- mice(nhanes_new, m=20, printFlag=FALSE, maxit = 40, seed=2525) 
fit.mi_new <- with(data=imp_new, expr = lm(chl ~ age + overwt))
combFit_new <- pool(fit.mi_new)
round(summary(combFit_new),2)

##             estimate std.error statistic    df p.value
## (Intercept)   154.68     19.27      8.03 12.57    0.00
## age2           25.38     20.97      1.21 13.06    0.25
## age3           42.18     23.06      1.83 12.31    0.09
## overwt1        34.46     18.87      1.83 13.35    0.09

How mice() actually imputes values

# See the univariate imputation model for each incomplete variable that mice() used
# for your data as default
imp_new$meth

##      age      hyp      chl   overwt 
##       ""    "pmm"    "pmm" "logreg"

# Possible imputation models provided by mice()
methods(mice)

##  [1] mice.impute.2l.bin       mice.impute.2l.lmer     
##  [3] mice.impute.2l.norm      mice.impute.2l.pan      
##  [5] mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart        
##  [9] mice.impute.jomoImpute   mice.impute.lda         
## [11] mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch  
## [15] mice.impute.norm         mice.impute.norm.boot   
## [17] mice.impute.norm.nob     mice.impute.norm.predict
## [19] mice.impute.panImpute    mice.impute.passive     
## [21] mice.impute.pmm          mice.impute.polr        
## [23] mice.impute.polyreg      mice.impute.quadratic   
## [25] mice.impute.rf           mice.impute.ri          
## [27] mice.impute.sample       mice.mids               
## [29] mice.theme              
## see '?methods' for accessing help and source code

Logistic regression method

The logistic regression method is used as a default when \(x\) is binary. Suppose again \(z\) is completely observed. Then the method is as follows.

Basic approach:

Using complete cases, estimate logistic regression for response \(x\) with predictor \(z\), obtaining \(\widehat{\beta}\)
Draw a new \(\beta^*\) from the posterior predictive distribution of \(\widehat{\beta}\)
Using \(\beta^*\), generate predicted probability of \(p=Pr(x=1)\) for subjects with missing \(x\)
Randomly sample a \(U(0,1)\) random variable \(u\) and set \(x=1\) if \(u<p\) and \(x=0\) otherwise
Repeat for all observations and imputation data sets

Change the default imputation methods

# Since hypertension is binary, we want to change the default imputation method
meth=imp_new$meth
meth = c("", "logreg", "pmm","logreg")
nhanes_new$hyp <- as.factor(nhanes_new$hyp)
imp_new2 <- mice(nhanes_new, m=20, printFlag=FALSE, maxit = 40, seed=2525,
                 method = meth) 
imp_new2$method

##      age      hyp      chl   overwt 
##       "" "logreg"    "pmm" "logreg"

# Visiting sequence
imp_new2$vis

## [1] "age"    "hyp"    "chl"    "overwt"

We could have declared hypertension as a factor at the beginning (as we did for age at the beginning), and the default would have been to use logistic regression for imputation for that variable as well.

Look at imputation model

# Predictor matrix: Which variables does mice() use as predictors 
# for imputation of each incomplete variable?
imp_new2$pred

##        age hyp chl overwt
## age      0   1   1      1
## hyp      1   0   1      1
## chl      1   1   0      1
## overwt   1   1   1      0

# We can specify relevant predictors as follows:
# Suppose hypertension should not predict overweight and cholesterol
pred=imp_new2$pred;
pred[, "hyp"] = 0
pred

##        age hyp chl overwt
## age      0   0   1      1
## hyp      1   0   1      1
## chl      1   0   0      1
## overwt   1   0   1      0