Birth weight is a commonly-used indicator of a newborn infant’s health status. Because this indicator is easily obtained, it is a critical component of population health reporting by the World Health Organization and governments around the globe. The distribution of birth weight has often been described as approximately normal, though the left tail in particular is inflated relative to that from a Gaussian distribution.
What are the approximate values of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) and their interpretation?
We will evaluate whether there is a relationship between the response, birth weight, and the predictors gestational age (measured in weeks) and sex.
#Read in birth data
o_data <- read.csv("~/Documents/TEACHING/vitalstats/Yr1116Birth.csv",
na.strings=c("9","99", "9999"))
#SEX=1 male, 2 female; male=1 male, 0 female
o_data$male=2-o_data$SEX #binary gender for interpretation
o_data$GEST_C=o_data$GEST; o_data$BWTG_C=o_data$BWTG
o_data$GEST_C[o_data$GEST_C>50]=NA
o_data$GEST_C[o_data$GEST_C<20]=NA
o_data$BWTG_C[o_data$BWTG_C<500]=NA
birth_data <- na.omit(o_data)
birth_data_2016=birth_data[which(birth_data$YOB==2016),]
BWTG
.Let \(y_i\) denote the observed response, ‘BWTG’, for observation \(i\), and \(x_i\) the observed covariates, ‘GEST’ and ‘male’. Then we aim to fit the model:
\[ y_i = \beta_0 + \beta_1 \text{GEST}_i + \beta_2 \text{male}_i + \epsilon_i \]
When we fit the line, we estimate the coefficients \(\mathbf{\widehat{\beta}}= \{\widehat{\beta}_0, \widehat{\beta}_1, \widehat{\beta}_2\}\), where \(\widehat{\beta} = (X^\prime X)^{-1} X^\prime Y\).
\[ \widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 \text{GEST}_i + \widehat{\beta}_2 \text{male}_i \] We denote residuals, \(e_i = y_i - \widehat{y}_i\) where \(\widehat{y}_i\) is the estimated predicted value for observation \(i\).
To make inference on the model parameters, \(\beta\), we check four assumptions for the linear regression model:
Or in summary, we assume the errors, \(\epsilon_i\), are independent, normally distributed with zero mean and constant variance: \[\epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2).\]
Let’s fit a linear model using the function lm
.
model1 = lm(BWTG~GEST+male, data=birth_data)
summary(model1)
##
## Call:
## lm(formula = BWTG ~ GEST + male, data = birth_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2956.9 -294.7 -19.3 272.3 4879.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4140.5308 9.5509 -433.5 <2e-16 ***
## GEST 190.4367 0.2468 771.7 <2e-16 ***
## male 126.2132 1.0502 120.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 441.9 on 708696 degrees of freedom
## Multiple R-squared: 0.4616, Adjusted R-squared: 0.4616
## F-statistic: 3.038e+05 on 2 and 708696 DF, p-value: < 2.2e-16
How do we interpret the results?
What do the residuals tell us about model fit?
par(mfrow=c(2,2))
plot(model1)
RxP plot (residuals versus fitted): you want to see a “happy cloud” of residuals with no obvious trends or patterns, centered around 0
Q-Q plot (standardized residuals vs order stats from normal): want to see a straight line with slope 1
Scale-Location plot (are residuals spread equally across range of predictors?): want equal spacing of points with horizontal smoother
Residuals vs Leverage (are outliers influencing model fit?): look for points in top and bottom right side corners to see if outliers are also high leverage points that may affect fit of line
ggplot(data = birth_data, mapping = aes(x = model1$fitted.values, y = BWTG)) +
geom_point(alpha=1/20) + geom_abline(intercept=0,slope=1,col='red') +
xlab("Predicted birth weight (g)") + ylab("Observed birth weight (g)")
What do you notice about the predicted values?
Diagnostic plots of the model are essential for checking model fit. Some common remedies depending on the problem include the following.
In some cases, the association between a predictor and the outcome may depend on a third variable, indicating an interaction term is needed. Let’s see if the gender-birth weight relationship is constant over gestational age.
\[ y_i = \beta_0 + \beta_1 \text{GEST}_i + \beta_2 \text{male}_i + \beta_3 \text{GEST}_i\text{male_i}+\epsilon_i \]
model2 = lm(BWTG~GEST+male+GEST*male, data=birth_data)
summary(model2)
##
## Call:
## lm(formula = BWTG ~ GEST + male + GEST * male, data = birth_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2953.1 -294.9 -19.4 271.1 4843.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4036.3885 13.8478 -291.483 < 2e-16 ***
## GEST 187.7374 0.3584 523.829 < 2e-16 ***
## male -71.6643 19.0822 -3.756 0.000173 ***
## GEST:male 5.1325 0.4942 10.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 441.9 on 708695 degrees of freedom
## Multiple R-squared: 0.4616, Adjusted R-squared: 0.4616
## F-statistic: 2.026e+05 on 3 and 708695 DF, p-value: < 2.2e-16
Fit a regression model with birth weight as the outcome and predictors including biological sex, total pregnancies (dichotomized as 1 or >1), smoking status (dichotomized as any versus none), and gestational age (linear term in weeks). Explore whether interaction or polynomial terms improve model fit.