Logistic Regression

Logistic regression

Logistic regression is a type of generalized linear model, which generalizes the typical linear model to non-Gaussian data. The logistic regression model is linear on the log of the odds: \[\log\frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1x_{1i}+\cdots+\beta_px_{pi},\] where \(\pi_i=Pr(y_i=1)\). If the parameter \(\beta_j>0\), then increasing levels of \(x_j\) are associated with higher probabilities that \(y=1\), and values of \(\beta_j<0\) are associated with lower probabilities that \(y=1\). \(\beta_j=0\) is consistent with no association between \(x_j\) and \(y\).

Question: where is the \(\varepsilon_i\) in this model?

Binary outcome: the basics

Suppose we have a binary outcome (e.g., \(y=1\) if diseased and \(y=0\) if not) and predictors on a variety of scales.

If the predictors are discrete and the binary outcomes are independent, we can use the Bernoulli distribution for individual 0-1 data or the binomial distribution for grouped data that are counts of successes in each group.

Models for binary outcomes

Contingency tables for continuous predictors (and more than a few categorical predictors) can quickly become unwieldy, so we need a new analytic method to model \(\pi=Pr(y=1)\).

One strategy might be to fit a linear regression model to the probabilities, e.g. model \[\pi_i=\beta_0+\beta_1x_i.\] The problem is that as a probability, \(\pi_i\) must be in the interval \([0,1]\), but there is nothing in the model that enforces this constraint, so that you could be estimating probabilities that are negative or that are greater than 1 – not a good thing!

Models for binary outcomes

An alternative that is sometimes used is to fit the model \[\pi_i=\exp(\beta_0+\beta_1x_i),\] which is equivalent to \[\ln(\pi_i)=\beta_0+\beta_1x_i.\] While exponentiating the linear predictor (\(\beta_0+\beta_1x_i\)) does ensure our estimated values of \(\pi_i\) are not negative, they can still be greater than 1. This is not ideal either.

Note: statisticians often use \(\ln\) and \(\log\) interchangeably to mean the natural log. If base 10 logarithms are desired, the notation \(\log_{10}\) is typically used.

Logistic regression

An attractive solution is to model the log odds (also called the logit) using \[\ln\frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1x_i,\] which is equivalent to

\[\frac{\pi_i}{1-\pi_i}=\exp(\beta_0+\beta_1x_i),\] which defines a multiplicative model for the odds. For example, if we change the \(j\)th predictor by one unit while holding the other variables constant, we multiply the odds by \(\exp(\beta_j)\) because \(\exp(\beta_j(x+1))=\exp(\beta_jx)\exp(\beta_j)\).

Logistic regression

\[\log\frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1x_{i}\] In this model, the odds of \(y=1\) for participants at covariate level \(x_i\) are given by \(e^{\beta_0+\beta_1x_i}\), and the odds of \(y=1\) for participants at covariate level \(x_i+1\) are given by \(e^{\beta_0+\beta_1(x_i+1)}\). We often describe the association between \(x\) and \(y\) in terms of the odds ratio (OR) given by \(\frac{e^{\beta_0+\beta_1x_i+\beta_1}}{e^{\beta_0+\beta_1x_i}}=e^{\beta_1}\).

Here, \(e^{\beta_1}\) is the odds ratio comparing participants who have \(x_i\) values one unit higher than counterparts who are otherwise similar.

Logistic regression model

In this model,

\[\pi_i=\frac{\exp(\beta_0+\beta_1x_i)}{1+\exp(\beta_0+\beta_1x_i)}.\]

The expression on the right is called a logistic function and cannot yield a value that is negative or a value that is \(>1\). A model of this form is known as a logistic regression model.

Other transformations (also called links) can be used to ensure the probabilities lie in \([0,1]\), including the probit (popular in econometrics) and complementary log-log. In practice these links all provide very similar fits to the data except at extreme values of \(\pi_i\).

Interpreting parameters in logistic regression

Typically we interpret functions of parameters in logistic regression rather than the parameters themselves. For the model \[\log\frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1x_i,\] we note that the probability that \(y=1\) when \(x=0\) is \(\frac{\exp(\beta_0)}{1+\exp(\beta_0)}.\) When \(x=1\), \(Pr(y=1)=\frac{\exp(\beta_0+\beta_1)}{1+\exp(\beta_0+\beta_1)}.\)

Interpreting parameters in logistic regression

Suppose that \(x\) is a binary (0/1) variable, e.g. \(x=1\) for Duke and \(x=0\) for UNC. In this case, the coefficient \(\beta_1\) has a special interpretation: we interpret \(\exp(\beta_1)\) as the odds ratio of the response for the two possible levels of \(x\). For \(x\) on other scales, \(\exp(\beta_1)\) is interpreted as the odds ratio of the response comparing two values of \(x\) one unit apart.

Why? The log odds of response for \(x=1\) is given by \(\beta_0+\beta_1\), and the log odds of response for \(x=0\) is \(\beta_0\). So the odds ratio of response comparing \(x=1\) to \(x=0\) is given by \(\frac{\exp(\beta_0+\beta_1)}{\exp(\beta_0)}=\exp(\beta_1)\).

In a model with more than one predictor, this OR is interpreted conditionally on the values of the other predictors staying fixed at any given values.

Hypothesis tests in logistic regression

Generally, we wish to know whether the OR=1 or equivalently whether the log OR (a \(\beta\) coefficient)=0. To test \(H_0:\beta_j=0\), we can compare the ratio of a parameter estimate to its standard error using \[z=\frac{\widehat{\beta}_j-0}{\sqrt{\widehat{\text{var}}(\widehat{\beta}_j)}},\] comparing this z-statistic to the standard normal distribution.

Confidence intervals

Confidence intervals for the effects on the logit scale, \(\widehat{\beta}_j \pm 1.96\sqrt{\widehat{\text{var}}(\widehat{\beta}_j)}\), are typically translated into confidence intervals for OR’s by exponentiating the lower and upper confidence limits as \(\exp\left(\widehat{\beta}_j \pm 1.96\sqrt{\widehat{\text{var}}(\widehat{\beta}_j)}\right)\).

Bernoulli versus binomial

Recall that the sum of \(n\) iid Bernoulli(p) random variables is binomial(n,p)

The logistic regression model can be used either for Bernoulli data or for data summarized as binomial counts.

Contraceptive use data

To illustrate, we consider data on contraceptive use as a function of age, education, and desire for more children from the Fiji Fertility Survey, provided online by German Rodriguez. This survey was given in 1974 to a group of women who were aged 15-49 who had been married at least once.

 Table from Pagano & Gauvreau
Table from Pagano & Gauvreau

Read data into R

cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)

Model

Researchers believe that contraceptive use is related to a woman’s age, her desire for more children, and her level of education. They expect an interaction between age and desire for more children due to the relationship between age and fertility.

Thus researchers will fit the model \[\begin{eqnarray*} \log\frac{\pi_i}{1-\pi_i}&=&\beta_0+\beta_1I(25\leq age_i \leq 29) + \beta_2I(30 \leq age_i \leq 39) \\ & & +\beta_3I(40 \leq age_i \leq 49) + \beta_4 I(morekids_i) \\ & & + \beta_5 I(lowedu_i) + \beta_6 I(25\leq age_i \leq 29)I(morekids_i) \\ & & + \beta_7I(30 \leq age_i \leq 39)I(morekids_i) \\ & & + \beta_8 I(40 \leq age_i \leq 49) I(morekids_i) \end{eqnarray*}\]

Understanding the model

\[\begin{eqnarray*} \log\frac{\pi_i}{1-\pi_i}&=&\beta_0+\beta_1I(25\leq age_i \leq 29) + \beta_2I(30 \leq age_i \leq 39) +\beta_3I(40 \leq age_i \leq 49) \\ & & + \beta_4 I(morekids_i) + \beta_5 I(lowedu_i) + \beta_6 I(25\leq age_i \leq 29)I(morekids_i) \\ & & + \beta_7I(30 \leq age_i \leq 39)I(morekids_i) + \beta_8 I(40 \leq age_i \leq 49) I(morekids_i) \end{eqnarray*}\]
Age Wants more kids? Education Expected log odds
<25 no high \(\beta_0\)
<25 no low \(\beta_0+\beta_5\)
<25 yes high \(\beta_0+\beta_4\)
<25 yes low \(\beta_0+\beta_4+\beta_5\)
25-29 no high \(\beta_0+\beta_1\)
25-29 no low \(\beta_0+\beta_1+\beta_5\)
25-29 yes high \(\beta_0+\beta_1+\beta_4+\beta_6\)
25-29 yes low \(\beta_0+\beta_1+\beta_4+\beta_5+\beta_6\)
30-39 no high \(\beta_0+\beta_2\)
30-39 no low \(\beta_0+\beta_2+\beta_5\)
30-39 yes high \(\beta_0+\beta_2+\beta_4+\beta_7\)
30-39 yes low \(\beta_0+\beta_2+\beta_4+\beta_5+\beta_7\)
40-49 no high \(\beta_0+\beta_3\)
40-49 no low \(\beta_0+\beta_3+\beta_5\)
40-49 yes high \(\beta_0+\beta_3+\beta_4+\beta_8\)
40-49 yes low \(\beta_0+\beta_3+\beta_4+\beta_5+\beta_8\)

Fitting the model

When using summary counts, we can provide R with the outcome counts in a matrix, with the first column indicating counts of “successes” (these will be treated as \(y=1\) occurrences) and the second column indicating counts of “failures” (\(y=0\)). Note that when we specify binomial data, the default link is the logit link.

m1=glm( cbind(using,notUsing) ~ age * wantsMore + education , family=binomial, 
        data=cuse)

If we did not have summary counts but instead had a variable \(y\) taking 1 for using and 0 for not using for every individual woman, we would replace cbind(using,notUsing) with y.

Output

summary(m1)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ age * wantsMore + education, 
##     family = binomial, data = cuse)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30027  -0.66163  -0.03286   0.81945   1.73851  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.39630    0.29782  -4.688 2.75e-06 ***
## age25-29               0.65378    0.35700   1.831  0.06705 .  
## age30-39               1.65933    0.32207   5.152 2.58e-07 ***
## age40-49               1.94120    0.35076   5.534 3.13e-08 ***
## wantsMoreyes          -0.06622    0.33071  -0.200  0.84130    
## educationlow          -0.34065    0.12577  -2.709  0.00676 ** 
## age25-29:wantsMoreyes -0.25918    0.40975  -0.633  0.52704    
## age30-39:wantsMoreyes -1.11266    0.37404  -2.975  0.00293 ** 
## age40-49:wantsMoreyes -1.36167    0.48433  -2.811  0.00493 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.77  on 15  degrees of freedom
## Residual deviance:  12.63  on  7  degrees of freedom
## AIC: 102.14
## 
## Number of Fisher Scoring iterations: 4

Interpreting the model results

exp(cbind(OR = coef(m1), confint.default(m1)))
##                              OR      2.5 %     97.5 %
## (Intercept)           0.2475100 0.13806655  0.4437078
## age25-29              1.9228028 0.95511291  3.8709251
## age30-39              5.2557652 2.79568462  9.8806093
## age40-49              6.9670894 3.50335816 13.8553733
## wantsMoreyes          0.9359252 0.48948954  1.7895297
## educationlow          0.7113094 0.55591313  0.9101440
## age25-29:wantsMoreyes 0.7716841 0.34566818  1.7227400
## age30-39:wantsMoreyes 0.3286828 0.15790314  0.6841687
## age40-49:wantsMoreyes 0.2562315 0.09916853  0.6620505

For simplicity, we can start by interpreting education, which is not involved in the interaction terms. At a given age and level of desire for more children, women with low levels of education have only 0.71 (95% CI=(0.56, 0.91)) times the odds of using contraceptives as women with higher levels of education.

Interpreting the age terms

The main effect terms in age describe the association between age and contraceptive use in the subgroup of women who do not want more children (interaction term 0), comparing each older age group to women under 25.

Thus given the same education level, a woman aged 40-49 who does not want more children has 6.97 (95% CI=(3.50, 13.86)) times the odds of using contraceptives as a woman <25 who also does not want more children. If you look at the other age levels, women aged 30-39 also have significantly higher odds of using contraceptives as women <25 who do not want more kids, though women 25-29 are not significantly different in their odds of contraceptive use as their younger counterparts.

Interpreting the age terms

We also may wish to make comparisons to an age group other than the under 25’s. For example, to obtain the odds ratio comparing a woman aged 40-49 to a woman aged 30-39, assuming both do not want more children and have the same educational level, we wish to calculate \[OR=\frac{e^{\beta_0+\beta_3+\beta_5I(lowedu_i)}}{e^{\beta_0+\beta_2+\beta_5I(lowedu_i)}}=e^{\beta_3-\beta_2}.\]

Because our estimates of \(\beta_3\) and \(\beta_2\) are likely not independent, we cannot simply sum their standard errors to get the appropriate standard error for our odds ratio.

Estimates not part of standard output

One way to do this is to use linear algebra to derive the right standard error from our covariance matrix of parameter estimates (R has various functions that do this, though they are not particularly intuitive). Perhaps an easier way is simply to fit the model again specifying a different referent group. If we change the referent group, we do change the equation used to fit the model (e.g., \(\beta_0\) will represent something different), so we just need to be sure we keep things straight in tables.

Changing the referent group

cuse$age <- relevel(cuse$age, ref = "30-39")
m2=glm( cbind(using,notUsing) ~ age * wantsMore + education , family=binomial, 
        data=cuse)
summary(m2)
exp(cbind(OR = coef(m2), confint.default(m2)))

What does \(\beta_0\) represent in this formulation of the model?

Changing the referent group

## 
## Call:
## glm(formula = cbind(using, notUsing) ~ age * wantsMore + education, 
##     family = binomial, data = cuse)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30027  -0.66163  -0.03286   0.81945   1.73851  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             0.2630     0.1328   1.981  0.04762 *  
## age<25                 -1.6593     0.3221  -5.152 2.58e-07 ***
## age25-29               -1.0055     0.2322  -4.331 1.49e-05 ***
## age40-49                0.2819     0.2095   1.345  0.17852    
## wantsMoreyes           -1.1789     0.1748  -6.744 1.55e-11 ***
## educationlow           -0.3406     0.1258  -2.709  0.00676 ** 
## age<25:wantsMoreyes     1.1127     0.3740   2.975  0.00293 ** 
## age25-29:wantsMoreyes   0.8535     0.2985   2.859  0.00425 ** 
## age40-49:wantsMoreyes  -0.2490     0.3946  -0.631  0.52803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.77  on 15  degrees of freedom
## Residual deviance:  12.63  on  7  degrees of freedom
## AIC: 102.14
## 
## Number of Fisher Scoring iterations: 4
##                              OR     2.5 %    97.5 %
## (Intercept)           1.3008544 1.0027656 1.6875552
## age<25                0.1902673 0.1012083 0.3576941
## age25-29              0.3658464 0.2320885 0.5766920
## age40-49              1.3256090 0.8791712 1.9987450
## wantsMoreyes          0.3076225 0.2183804 0.4333338
## educationlow          0.7113094 0.5559131 0.9101440
## age<25:wantsMoreyes   3.0424469 1.4616278 6.3329961
## age25-29:wantsMoreyes 2.3478080 1.3078148 4.2148188
## age40-49:wantsMoreyes 0.7795707 0.3597115 1.6894944

Changing the referent group

So the woman aged 40-49 who wants no more kids has 1.33 (95% CI=(0.88, 2.00)) times the odds of using contraceptives as her counterpart aged 30-39 who also wants no more kids and has the same level of education.

A woman aged 25-29 who wants no more kids has 0.37 (95% CI=(0.23, 0.58)) times the odds of using contraceptives as her counterpart aged 30-39 who also wants no more kids and has the same level of education.

Now add some interaction terms!

To make comparisons among women who do want more children, we need to add the main effect and interaction terms. Again, this will complicate our job of getting confidence intervals because we cannot just add their standard errors. Alternatively, we can use the same trick we used before, changing the referent group of the desire for children variable.

Comparisons among women who do want more kids

cuse$age <- relevel(cuse$age, ref = "<25") #change back to young referent
cuse$wantsMore <- relevel(cuse$wantsMore, ref = "yes") #make morekids referent
m3=glm( cbind(using,notUsing) ~ age * wantsMore + education , family=binomial, 
        data=cuse)
summary(m3)
exp(cbind(OR = coef(m3), confint.default(m3)))

Comparisons among women wanting more

## 
## Call:
## glm(formula = cbind(using, notUsing) ~ age * wantsMore + education, 
##     family = binomial, data = cuse)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30027  -0.66163  -0.03286   0.81945   1.73851  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.46252    0.14636  -9.993  < 2e-16 ***
## age30-39              0.54666    0.19842   2.755  0.00587 ** 
## age25-29              0.39460    0.20145   1.959  0.05013 .  
## age40-49              0.57952    0.34742   1.668  0.09530 .  
## wantsMoreno           0.06622    0.33071   0.200  0.84130    
## educationlow         -0.34065    0.12577  -2.709  0.00676 ** 
## age30-39:wantsMoreno  1.11266    0.37404   2.975  0.00293 ** 
## age25-29:wantsMoreno  0.25918    0.40975   0.633  0.52704    
## age40-49:wantsMoreno  1.36167    0.48433   2.811  0.00493 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.77  on 15  degrees of freedom
## Residual deviance:  12.63  on  7  degrees of freedom
## AIC: 102.14
## 
## Number of Fisher Scoring iterations: 4
##                             OR     2.5 %     97.5 %
## (Intercept)          0.2316509 0.1738806  0.3086148
## age30-39             1.7274796 1.1708926  2.5486420
## age25-29             1.4837964 0.9997685  2.2021615
## age40-49             1.7851877 0.9035700  3.5270040
## wantsMoreno          1.0684614 0.5588060  2.0429446
## educationlow         0.7113094 0.5559131  0.9101440
## age30-39:wantsMoreno 3.0424469 1.4616278  6.3329961
## age25-29:wantsMoreno 1.2958670 0.5804707  2.8929478
## age40-49:wantsMoreno 3.9027210 1.5104588 10.0838446

Finish describing the results

Using these derivations as a guide, describe the model results using language suitable for publication in a non-statistical journal. (Translation: you can use the terminology OR and CI but need to be specific about all the comparisons, not expecting the reader to be able to do them by hand.)