8/19/2020

Basic Visualizations for EDA

descrim <- read.csv("data/cal_discrimination.csv", stringsAsFactors = T)
par(mfrow = c(2,2))
plot(descrim[,c("Age","Age.Cohort", "Expenditures")])

coplot(Expenditures ~ Age | Ethnicity, panel = panel.smooth, data = descrim)

boxplot(Expenditures ~ Gender, data = descrim)
par(mfrow = c(1,1))

Also good to learn ggplot2 and lattice packages.

Basic Model Fitting

flmod <- lm(Expenditures ~ Age.Cohort + Gender + Ethnicity,
            data = descrim)
summary(flmod)
## 
## Call:
## lm(formula = Expenditures ~ Age.Cohort + Gender + Ethnicity, 
##     data = descrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19863.6  -1346.0     36.5   1315.7  21169.3 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    137.8     2001.5   0.069 0.945143    
## Age.Cohort 51 +              52298.1      587.8  88.977  < 2e-16 ***
## Age.Cohort12-Jun               785.7      520.1   1.511 0.131207    
## Age.Cohort13-17               2552.9      507.4   5.031 5.78e-07 ***
## Age.Cohort18-21               8557.7      515.4  16.605  < 2e-16 ***
## Age.Cohort22-50              39007.0      516.1  75.573  < 2e-16 ***
## GenderMale                    -954.1      246.1  -3.877 0.000113 ***
## EthnicityAsian                1492.9     1979.7   0.754 0.450969    
## EthnicityBlack                1865.2     2012.7   0.927 0.354310    
## EthnicityHispanic             1904.5     1962.6   0.970 0.332086    
## EthnicityMulti Race           1678.2     2101.3   0.799 0.424698    
## EthnicityNative Hawaiian      -474.7     2966.5  -0.160 0.872887    
## EthnicityOther                1102.9     3373.0   0.327 0.743755    
## EthnicityWhite not Hispanic   1491.8     1955.8   0.763 0.445783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3879 on 986 degrees of freedom
## Multiple R-squared:  0.9611, Adjusted R-squared:  0.9606 
## F-statistic:  1874 on 13 and 986 DF,  p-value: < 2.2e-16

Bayesian Model Fitting a.k.a. Posterior Computation

  1. Fully Conjugate Models
  2. Writing Custom Gibbs/MH Samplers
  3. Using JAGS
  4. Using Stan

Fully Conjugate

Advantages:

  1. Fastest
  2. No ambiguity in convergence

Disadvantages:

  1. (Usually) only available for simple models

Example: Normal-Inverse Gamma prior for mean and variance of normal data

Writing Custom Gibbs/MH Samplers

Advantages:

  1. Full customization lets you implement model-specific speed-ups
  2. No reliance on black-box methods

Disadvantages:

  1. Code can be slower or have errors
  2. (Often) has to be written new for each project

Example: Half-normal priors for variance of normal data

Using JAGS

Advantages:

  1. Once you are familiar with syntax, writting down models is easy.
  2. Pretty fast for some models

Disadvantages:

  1. Need to become familiar with syntax
  2. Pretty slow for some model

JAGS Example - Model

\[Y_i = \sum_j \beta_jX_{ij} + \epsilon_i, \ \epsilon \sim N(0,\sigma^2) \implies Y_i \sim N\left(\sum_j \beta_jX_{ij},\sigma^2\right)\] Prior:

\[\boldsymbol{\beta}|\sigma^2 \sim MVN(\mathbf{0},n \sigma^2 (X^T X)^{-1}), \ \sigma^{2} \sim Gam(1/2,1/2) \]

JAGS Example - Model Syntax

model <- function(){
  for(i in 1:n){
    y[i] ~ dnorm(mu[i],tau)
    mu[i] = inprod(X[i,1:p],beta) 
    resid[i] = y[i] - mu[i]
  }
  
  m0 = rep(0,p)
  
  beta ~ dmnorm(m0,tau*(t(X) %*% X)/n)
  tau = pow(sigma2,-1)
  sigma = pow(sigma2,1/2)
  sigma2 ~ dgamma(1/2,1/2)
}

JAGS Example - Fitting the Model

descrim.modelmat <- 
  model.matrix(Expenditures ~ Age.Cohort + Gender + Ethnicity,
               data = descrim)
jags.result.notconverged <- 
  jags(data = list(y=descrim$Expenditures,X=descrim.modelmat,
                   n=nrow(descrim.modelmat),p=ncol(descrim.modelmat)),
       model.file = model,parameters.to.save = c("beta","sigma"),
       n.iter = 1000,n.burnin = 0)
## module glm loaded
jags.result.converged <- 
  jags(data = list(y=descrim$Expenditures,X=descrim.modelmat,
                   n=nrow(descrim.modelmat),p=ncol(descrim.modelmat)),
       model.file = model,parameters.to.save = c("beta","sigma"),
       n.iter = 1000)

Using Stan

Advantages:

  1. Basically the same as JAGS, but faster/slower for some models.
  2. Has automatic warnings/checks for posterior convergence

Disadvantages:

  1. Cannot sample discrete parameters
  2. Another model syntax to learn

Stan Example

model_text <- "
data {
  int<lower=0> N;  // Number of observations
  int<lower=0> P;  // Number of predictors
  matrix[N,P] x;  // Covariate data
  vector[N] y;    // Response
}
parameters {
  vector[P] beta; // regression coeffs
  real<lower=0> sigma; // sd of error term
}
model {
  y ~ normal(x * beta, sigma); // sd parameterization
}
"

Stan Example – Fit

model_data <- list(y = descrim$Expenditures, x = descrim.modelmat, 
                   N = nrow(descrim.modelmat), 
                   P = ncol(descrim.modelmat))
stanfit <- stan(model_code = model_text, data = model_data)
colnames(descrim.modelmat)
##  [1] "(Intercept)"                 "Age.Cohort 51 +"            
##  [3] "Age.Cohort12-Jun"            "Age.Cohort13-17"            
##  [5] "Age.Cohort18-21"             "Age.Cohort22-50"            
##  [7] "GenderMale"                  "EthnicityAsian"             
##  [9] "EthnicityBlack"              "EthnicityHispanic"          
## [11] "EthnicityMulti Race"         "EthnicityNative Hawaiian"   
## [13] "EthnicityOther"              "EthnicityWhite not Hispanic"
stanfit
## Inference for Stan model: 93c7df36ddbda0cebec5440a0d588862.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##              mean se_mean      sd     2.5%      25%      50%      75%    97.5%
## beta[1]    204.44   65.87 1918.93 -3536.86 -1089.45   222.17  1499.11  3995.12
## beta[2]  52277.24   13.71  574.93 51172.30 51890.92 52273.76 52661.89 53412.46
## beta[3]    760.30   13.34  508.28  -247.83   418.45   752.54  1105.71  1776.87
## beta[4]   2533.83   12.71  507.93  1531.75  2192.43  2538.56  2862.19  3528.03
## beta[5]   8546.30   12.47  506.81  7575.76  8199.76  8547.02  8893.16  9549.65
## beta[6]  38982.23   13.12  508.79 37980.72 38641.13 38986.24 39319.11 39991.48
## beta[7]   -953.13    4.16  249.50 -1440.55 -1122.87  -955.30  -785.54  -477.56
## beta[8]   1442.47   65.25 1904.00 -2403.06   165.51  1446.47  2750.71  5212.78
## beta[9]   1824.81   65.75 1941.29 -1979.38   525.75  1816.95  3130.25  5653.17
## beta[10]  1859.60   64.52 1877.61 -1875.81   610.87  1825.89  3139.55  5542.52
## beta[11]  1619.36   66.27 2052.90 -2401.84   262.00  1622.77  3000.93  5617.17
## beta[12]  -531.25   71.23 2861.61 -6316.08 -2452.85  -466.35  1436.04  4925.53
## beta[13]  1056.29   76.42 3315.15 -5391.77 -1209.35  1024.02  3364.17  7472.77
## beta[14]  1445.28   64.16 1870.37 -2278.21   190.35  1436.81  2707.75  5101.64
## sigma     3883.73    1.56   87.06  3718.07  3823.89  3882.64  3942.01  4060.67
## lp__     -8755.65    0.06    2.69 -8761.88 -8757.24 -8755.35 -8753.71 -8751.40
##          n_eff Rhat
## beta[1]    849 1.01
## beta[2]   1759 1.00
## beta[3]   1451 1.00
## beta[4]   1597 1.00
## beta[5]   1652 1.00
## beta[6]   1505 1.00
## beta[7]   3599 1.00
## beta[8]    851 1.01
## beta[9]    872 1.01
## beta[10]   847 1.01
## beta[11]   960 1.00
## beta[12]  1614 1.00
## beta[13]  1882 1.00
## beta[14]   850 1.01
## sigma     3120 1.00
## lp__      1723 1.00
## 
## Samples were drawn using NUTS(diag_e) at Tue Aug 25 17:26:17 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

What do we do with all those simulations?

  1. Run Diagnostics (for Markov Chain MC)
  2. Run Diagnostics (for Model)
  3. Calculate Interesting Quantities!

MCMC Diagnostics

Traceplots

Lag-1 Scatter plots

ACF plots

Rhat/ESS

MCMC Diagnostics - Traceplots

MCMC Diagnostics - Lag-1 Scatter Plots

MCMC Diagnostics - ACF Plots

MCMC Diagnostics - Rhat/ESS

jags.result.converged$BUGSoutput$summary[,c(1:3,8:9)]
##                 mean          sd        2.5%     Rhat n.eff
## beta[1]     140.3624 187.8502728   -234.5242 1.000219  1500
## beta[2]   52246.5102  53.6474951  52144.3184 1.000920  1500
## beta[3]     784.6471  48.5097168    689.7892 1.001325  1500
## beta[4]    2550.3469  46.9390941   2462.9752 1.000112  1500
## beta[5]    8550.5161  46.1282421   8466.2312 1.000012  1500
## beta[6]   38967.7458  47.7812196  38876.5798 1.000146  1500
## beta[7]    -953.3196  22.3303788   -998.5721 1.001217  1500
## beta[8]    1487.0187 185.3278499   1137.0435 1.000218  1500
## beta[9]    1862.9309 188.8188078   1495.0616 1.000384  1500
## beta[10]   1900.0254 183.9362548   1553.0195 1.000381  1500
## beta[11]   1672.6059 195.6848633   1306.7816 1.000123  1500
## beta[12]   -483.7150 273.7178072   -995.4265 1.001246  1500
## beta[13]   1098.9160 309.2735209    490.8849 1.002236  1500
## beta[14]   1486.9836 183.1513027   1141.6414 1.000151  1500
## deviance 133142.7341 324.1448455 132530.8350 1.001113  1500
## sigma       352.3108   0.4812708    351.4069 1.001074  1500

MCMC Diagnostics - Rhat/ESS

jags.result.notconverged$BUGSoutput$summary[,c(1:3,8:9)]
##                  mean           sd         2.5%     Rhat n.eff
## beta[1]      135.5560 7.507217e+01    -16.17822 1.025969  3000
## beta[2]    52245.7844 2.256324e+01  52197.85118 1.029244  3000
## beta[3]      784.7867 1.991599e+01    744.54120 1.031381  2800
## beta[4]     2550.4673 1.975082e+01   2507.64459 1.030485  3000
## beta[5]     8549.0843 1.973604e+01   8505.46399 1.034938  3000
## beta[6]    38968.0390 1.941139e+01  38927.34089 1.029140  3000
## beta[7]     -952.9584 9.371752e+00   -972.28388 1.036897   600
## beta[8]     1493.6661 7.422163e+01   1343.41710 1.027225  3000
## beta[9]     1865.6389 7.544433e+01   1706.97182 1.026419  3000
## beta[10]    1904.4777 7.354164e+01   1750.21943 1.026816  3000
## beta[11]    1678.4865 7.919101e+01   1520.02263 1.029884  3000
## beta[12]    -472.4816 1.117833e+02   -705.12299 1.026765  3000
## beta[13]    1101.8786 1.264931e+02    836.64745 1.040336  1700
## beta[14]    1492.3234 7.337845e+01   1340.07519 1.027601  3000
## deviance 1447338.5030 1.949311e+06 273667.36324 1.191420    15
## sigma        138.0631 5.019918e+01     43.88492 1.194866    15

Model Diagnostics

posterior predictive distribution

QQ-plots for regression

Residuals vs fitted values

Model Diagnostics - Posterior Predictive

“The idea behind posterior predictive checking is simple: if a model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.” –Jonah Gabry in “Graphical posterior predictive checks using the bayesplot package” https://mc-stan.org/bayesplot/articles/graphical-ppcs.html

## [1] 0.4831
## [1] 0

Model Diagnostics - QQ-plots

Model Diagnostics - Residuals vs Fitted Values

Posterior Inference

One of the big payoffs of Bayesian inference and especially Monte Carlo computation is straightforward calculation of interpretable inferential quantities. In the typical frequentist setting there can annoying computational and interpretational challenges in estimating, say, \(\Pr(\alpha > 3\beta + \gamma )\), but this is no problem when we’ve simulated a large number of draws from the joint posterior distribution of our parameters.

Posterior Inference

Once we have fit a model that we are satisfied with, and have performed basic checks for MCMC posterior convergence, we can proceed with inference tasks. Using our posterior samples from Stan or JAGS, and assuming we were satisfied with our model, let’s calculate:

  1. A 95% CI for the difference in expenditure for Males vs. Females (of referent ethnicity and age cohort, if you included interactions)
  2. Probability that a person who identifies as Multi Race receives less expenditure than a person who is Hispanic
  3. The probability that the standard error of the residuals (sigma) is greater than 4000.
## [1] 7
## Inference for Stan model: 93c7df36ddbda0cebec5440a0d588862.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##              mean se_mean      sd     2.5%      25%      50%      75%    97.5%
## beta[1]    204.44   65.87 1918.93 -3536.86 -1089.45   222.17  1499.11  3995.12
## beta[2]  52277.24   13.71  574.93 51172.30 51890.92 52273.76 52661.89 53412.46
## beta[3]    760.30   13.34  508.28  -247.83   418.45   752.54  1105.71  1776.87
## beta[4]   2533.83   12.71  507.93  1531.75  2192.43  2538.56  2862.19  3528.03
## beta[5]   8546.30   12.47  506.81  7575.76  8199.76  8547.02  8893.16  9549.65
## beta[6]  38982.23   13.12  508.79 37980.72 38641.13 38986.24 39319.11 39991.48
## beta[7]   -953.13    4.16  249.50 -1440.55 -1122.87  -955.30  -785.54  -477.56
## beta[8]   1442.47   65.25 1904.00 -2403.06   165.51  1446.47  2750.71  5212.78
## beta[9]   1824.81   65.75 1941.29 -1979.38   525.75  1816.95  3130.25  5653.17
## beta[10]  1859.60   64.52 1877.61 -1875.81   610.87  1825.89  3139.55  5542.52
## beta[11]  1619.36   66.27 2052.90 -2401.84   262.00  1622.77  3000.93  5617.17
## beta[12]  -531.25   71.23 2861.61 -6316.08 -2452.85  -466.35  1436.04  4925.53
## beta[13]  1056.29   76.42 3315.15 -5391.77 -1209.35  1024.02  3364.17  7472.77
## beta[14]  1445.28   64.16 1870.37 -2278.21   190.35  1436.81  2707.75  5101.64
## sigma     3883.73    1.56   87.06  3718.07  3823.89  3882.64  3942.01  4060.67
## lp__     -8755.65    0.06    2.69 -8761.88 -8757.24 -8755.35 -8753.71 -8751.40
##          n_eff Rhat
## beta[1]    849 1.01
## beta[2]   1759 1.00
## beta[3]   1451 1.00
## beta[4]   1597 1.00
## beta[5]   1652 1.00
## beta[6]   1505 1.00
## beta[7]   3599 1.00
## beta[8]    851 1.01
## beta[9]    872 1.01
## beta[10]   847 1.01
## beta[11]   960 1.00
## beta[12]  1614 1.00
## beta[13]  1882 1.00
## beta[14]   850 1.01
## sigma     3120 1.00
## lp__      1723 1.00
## 
## Samples were drawn using NUTS(diag_e) at Tue Aug 25 17:26:17 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
## [1] 0.6125
## [1] 0.52629
## [1] 0.5241373
## [1] 0.0955