Linear Regression

Birth weight

Birth weight is a commonly-used indicator of a newborn infant’s health status. Because this indicator is easily obtained, it is a critical component of population health reporting by the World Health Organization and governments around the globe. The distribution of birth weight has often been described as approximately normal, though the left tail in particular is inflated relative to that from a Gaussian distribution.

Review: Slope-intercept form of line

Here, \(m=0.75\) and \(b=-1\)
\(b\) is the y-intercept, or where the line crosses the y-axis. It is the predicted value of \(y\) when \(x=0\)

Review: Slope-intercept form of line

\(m\) is the slope, which tells us the predicted increase in \(y\) when \(x\) changes by 1 unit
What is our predicted \(y\) when \(x=2\)?

Review: Slope-intercept form of line

In statistics, we often represent the slope with \(\beta_1\) and the intercept with \(\beta_0\), and these values are usually not known but must be estimated, yielding \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\). What are those values here?

Linear regression review

Heart disease risk as function of alcohol intake

What are the approximate values of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) and their interpretation?

Read data into R

We will evaluate whether there is a relationship between the response, birth weight, and the predictors gestational age (measured in weeks) and sex.

Check for NAs in data - appear as 9, 99, 9999 for our variables of interest (birth weight, gestational age, and sex)
For now, remove the rows with missing values
Focus on most recent year with data, 2016
Do some rudimentary data cleaning

#Read in birth data
o_data <- read.csv("~/Documents/TEACHING/vitalstats/Yr1116Birth.csv", 
  na.strings=c("9","99", "9999"))
#SEX=1 male, 2 female; male=1 male, 0 female
o_data$male=2-o_data$SEX #binary gender for interpretation
o_data$GEST_C=o_data$GEST; o_data$BWTG_C=o_data$BWTG
o_data$GEST_C[o_data$GEST_C>50]=NA
o_data$GEST_C[o_data$GEST_C<20]=NA
o_data$BWTG_C[o_data$BWTG_C<500]=NA
birth_data <- na.omit(o_data)
birth_data_2016=birth_data[which(birth_data$YOB==2016),]

Is birth weight normally distributed?

Compare histogram of birth weight to a normal distribution parameterized with a mean and variance that is set based on birth weight variable, BWTG.
Do the distributions align?

Linear regression

Let \(y_i\) denote the observed response, ‘BWTG’, for observation \(i\), and \(x_i\) the observed covariates, ‘GEST’ and ‘male’. Then we aim to fit the model:

\[ y_i = \beta_0 + \beta_1 \text{GEST}_i + \beta_2 \text{male}_i + \epsilon_i \]

When we fit the line, we estimate the coefficients \(\mathbf{\widehat{\beta}}= \{\widehat{\beta}_0, \widehat{\beta}_1, \widehat{\beta}_2\}\), where \(\widehat{\beta} = (X^\prime X)^{-1} X^\prime Y\).

\[ \widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 \text{GEST}_i + \widehat{\beta}_2 \text{male}_i \] We denote residuals, \(e_i = y_i - \widehat{y}_i\) where \(\widehat{y}_i\) is the estimated predicted value for observation \(i\).

Model Assumptions

To make inference on the model parameters, \(\beta\), we check four assumptions for the linear regression model:

The mean of the response, \(E(y_i) = \beta_0 + \beta_1 \text{GEST}_i + \beta_2 \text{male}_i\) is linear function of the \(x_i\) at each value of the predictors \(x_i\).
The errors, \(\epsilon_i\), are independent.
The errors, \(\epsilon_i\), at each value of the predictor, \(x_i\), are normally distributed.
The errors, \(\epsilon_i\), at each value of the predictor, \(x_i\), have equal variances.

Or in summary, we assume the errors, \(\epsilon_i\), are independent, normally distributed with zero mean and constant variance: \[\epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2).\]

Fit regression to data

Let’s fit a linear model using the function lm.

model1 = lm(BWTG~GEST+male, data=birth_data)
summary(model1)

## 
## Call:
## lm(formula = BWTG ~ GEST + male, data = birth_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2956.9  -294.7   -19.3   272.3  4879.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4140.5308     9.5509  -433.5   <2e-16 ***
## GEST          190.4367     0.2468   771.7   <2e-16 ***
## male          126.2132     1.0502   120.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 441.9 on 708696 degrees of freedom
## Multiple R-squared:  0.4616, Adjusted R-squared:  0.4616 
## F-statistic: 3.038e+05 on 2 and 708696 DF,  p-value: < 2.2e-16

How do we interpret the results?

Look at residuals

What do the residuals tell us about model fit?

par(mfrow=c(2,2))
plot(model1)

Guide to residual plots

RxP plot (residuals versus fitted): you want to see a “happy cloud” of residuals with no obvious trends or patterns, centered around 0
Q-Q plot (standardized residuals vs order stats from normal): want to see a straight line with slope 1
Scale-Location plot (are residuals spread equally across range of predictors?): want equal spacing of points with horizontal smoother
Residuals vs Leverage (are outliers influencing model fit?): look for points in top and bottom right side corners to see if outliers are also high leverage points that may affect fit of line

It is also a good idea to examine the predictions from the model against the true observed values

ggplot(data = birth_data, mapping = aes(x = model1$fitted.values, y = BWTG)) +
  geom_point(alpha=1/20) + geom_abline(intercept=0,slope=1,col='red') + 
  xlab("Predicted birth weight (g)") + ylab("Observed birth weight (g)")

What do you notice about the predicted values?

Diagnosing the problem

Diagnostic plots of the model are essential for checking model fit. Some common remedies depending on the problem include the following.

Model improvements - predictions can often be improved by refining the mean model, e.g. by including polynomial terms or interactions
Deletion of outliers - often done in practice, but throwing away data is generally a bad idea. Careful data cleaning is often helpful (and some algorithms are availble for vital records data).
Transformations (e.g. sqrt(BTWG)) - transforming outcome may fix skewness but can make interpretting the model more difficult
Robust regression - use a more flexible regression model to accommodate the data. While there are many methods for robust regression, they share a general strategy of giving lower weight to observations that have large influence on the regression model fit.

Interaction

In some cases, the association between a predictor and the outcome may depend on a third variable, indicating an interaction term is needed. Let’s see if the gender-birth weight relationship is constant over gestational age.

\[ y_i = \beta_0 + \beta_1 \text{GEST}_i + \beta_2 \text{male}_i + \beta_3 \text{GEST}_i\text{male_i}+\epsilon_i \]

What is the slope for males? Females?
How do you determine if gender is related to birth weight? If gestational age is related to birth weight?

Interaction code

model2 = lm(BWTG~GEST+male+GEST*male, data=birth_data)
summary(model2)

## 
## Call:
## lm(formula = BWTG ~ GEST + male + GEST * male, data = birth_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2953.1  -294.9   -19.4   271.1  4843.0 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -4036.3885    13.8478 -291.483  < 2e-16 ***
## GEST          187.7374     0.3584  523.829  < 2e-16 ***
## male          -71.6643    19.0822   -3.756 0.000173 ***
## GEST:male       5.1325     0.4942   10.385  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 441.9 on 708695 degrees of freedom
## Multiple R-squared:  0.4616, Adjusted R-squared:  0.4616 
## F-statistic: 2.026e+05 on 3 and 708695 DF,  p-value: < 2.2e-16

Class Exercise:

Fit a regression model with birth weight as the outcome and predictors including biological sex, total pregnancies (dichotomized as 1 or >1), smoking status (dichotomized as any versus none), and gestational age (linear term in weeks). Explore whether interaction or polynomial terms improve model fit.