This post outlines the steps for performing a logistic regression in R. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

- Check variable codings and distributions
- Graphically review bivariate associations
- Fit the logit model
- Interpret results in terms of odds ratios
- Interpret results in terms of predicted probabilities

The variables we use will be:

`vote`

: Whether the respondent voted for Clinton or Trump`gender`

: Male or female`age`

: The age (in years) of the respondent`educ`

: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

## Univariate Summaries

Let’s first load the packages we will use and the data.

```
library(tidyverse)
library(sjlabelled)
library(haven)
library(knitr)
library(broom)
datafull <- read_dta("data/cleaned-anes.dta") %>%
haven::zap_labels()
tbl <- datafull %>%
select(c("vote", "gender", "age", "educ"))
```

The data are in Stata (.dta) format, so use `haven`

to read it in. Also, note that the `read_dta`

function assumes that the data are saved in the `data`

folder inside the current working directory. By default, `read_dta`

will import the variable and value labels from Stata as variable attributes in R. This sometimes creates conflicts with certain R functions, we `zap_labels()`

to remove them.

All of the variables we want to use are numeric. We will convert them to labeled factors to facilitate interpretation and the construction of graphs.

```
tbl <- tbl %>%
mutate(gender_factor = factor(gender,
levels = 1:2,
labels = c("Male", "Female")),
vote_factor = factor(vote,
levels = 1:2,
labels = c("Clinton", "Trump")),
educ_factor = factor(educ,
levels = 1:5,
labels = c("HS Not Completed",
"Completed HS",
"College <4 Years",
"College 4 Year Degree",
"Advanced Degree")))
```

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. The variable `vote`

is the dependent variable. We can check the distribution of responses using the `count`

function:

```
tbl %>%
count(vote_factor) %>%
mutate(prop = prop.table(n)) %>%
kable(align = c("l","c","c"),
col.names = c("Vote", "N", "Proportion"),
digits = 3)
```

Vote | N | Proportion |
---|---|---|

Clinton | 1269 | 0.52 |

Trump | 1171 | 0.48 |

Now review the distribution for the other categorical variables.

```
tbl %>%
count(gender_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Gender", "N", "Proportion"),
digits = 3)
```

Gender | N | Proportion |
---|---|---|

Male | 1128 | 0.462 |

Female | 1312 | 0.538 |

```
tbl %>%
count(educ_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Education", "N", "Proportion"),
digits = 3)
```

Education | N | Proportion |
---|---|---|

HS Not Completed | 102 | 0.042 |

Completed HS | 381 | 0.156 |

College <4 Years | 838 | 0.343 |

College 4 Year Degree | 624 | 0.256 |

Advanced Degree | 479 | 0.196 |

NA | 16 | 0.007 |

We can also check out the distribution of age. We will create a table of summary statistics using the `summarise`

function.

```
tbl %>%
summarise(Min = min(age, na.rm = T),
Max = max(age, na.rm = T),
Median = median(age, na.rm = T),
Mean = mean(age, na.rm = T),
SD = sd(age, na.rm = T)) %>%
mutate_all(~round(., 3)) %>%
kable(align = rep("c", 4))
```

Min | Max | Median | Mean | SD |
---|---|---|---|---|

18 | 90 | 54 | 51.998 | 17.19 |

The `Min`

value is the lowest observed age, which is 18. The `Max`

value is the largest, which is 90. The `Median`

age is 54, and the `Mean`

age is 52 with a standard deviation of 17.19.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

```
tbl %>%
ggplot(aes(x=vote_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Vote") +
ylab("Proportion") +
ylim(0,.55)
```

In the sample, Clinton received more votes than Trump, but not by a large amount.

Now the categorical independent variables.

```
tbl %>%
ggplot(aes(x=gender_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.55)
```

We see that our sample has more females than males.

```
tbl %>%
ggplot(aes(x=educ_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.4)
```

Within our sample, the modal respondent has some college, with the second most populated category being college educated.

For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram.

```
tbl %>%
ggplot(aes(x=age)) +
geom_histogram(fill = 'firebrick', color = 'black', bins = 20) +
xlab("Age") +
ylab("Count")
```

```
tbl %>%
ggplot(aes(x=age, y = ..density.., group = 1)) +
stat_bin(binwidth = 5, fill = 'firebrick', color = 'black') +
xlab("Age") +
ylab("Proportion")
```

We now have a good sense as to what the distributions of all of our variables are and do not see any evidence that recodes are necessary.

## Bivariate Summaries

Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.

Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.

```
tbl %>%
ggplot(aes(x=gender_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar( position = 'dodge') +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1)) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.65) +
labs(fill = "Vote")
```

The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.

A similar figure can be made for education.

```
tbl %>%
ggplot(aes(x=educ_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar(position = 'dodge') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1),
size = 3) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.45) +
labs(fill = "Vote")
```

The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases with education.

Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale.

```
tbl %>%
ggplot(aes(x = vote_factor,
y = age,
group = vote_factor,
fill = vote_factor)) +
geom_boxplot() +
coord_flip() +
ylab("Age") +
xlab("Vote") +
theme(legend.position = "none")
```

The `coord_flip()`

function is used to keep the dependent variable on the y-axis.

There’s a lot of overlap between the two boxes, though the Trump box sits a little higher than the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.

Having carefully reviewed the data, we can now move to estimating the model.

## Fitting the Model

To fit a logistic regression in R, we will use the `glm`

function, which stands for Generalized Linear Model. Within this function, write the dependent variable, followed by `~`

, and then the independent variables separated by `+`

’s. When the family is specified as `binomial`

, R defaults to fitting a logit model. Note that we are also treating `educ`

as a numeric variable here.

```
mod_log <- tbl %>%
mutate(vote = as.factor(vote)) %>%
glm(formula = vote ~ gender_factor + educ + age, family = binomial)
```

The `broom`

package contains useful functions, like `tidy()`

, for viewing the output from common models. The `conf.int = T`

argument requests confidence intervals.

```
mod_log %>%
tidy(conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 4)))
```

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 0.275 | 0.196 | 1.406 | 0.16 | -0.108 | 0.660 |

gender_factorFemale | -0.356 | 0.084 | -4.230 | 0.00 | -0.521 | -0.191 |

educ | -0.252 | 0.039 | -6.507 | 0.00 | -0.328 | -0.176 |

age | 0.013 | 0.002 | 5.322 | 0.00 | 0.008 | 0.018 |

Because `gender_factor`

is a factor variable, R will automatically create dummy variables for us. `educ`

is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is `age`

. We find that `gender`

, `age`

, and `educ`

have significant effects.

We can also look at our results in terms of odds ratios by setting `exponentiate = T`

in the `tidy()`

function.

```
mod_log %>%
tidy(exponentiate = T, conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 6)))
```

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 1.317 | 0.196 | 1.406 | 0.16 | 0.897 | 1.935 |

gender_factorFemale | 0.701 | 0.084 | -4.230 | 0.00 | 0.594 | 0.826 |

educ | 0.777 | 0.039 | -6.507 | 0.00 | 0.720 | 0.838 |

age | 1.013 | 0.002 | 5.322 | 0.00 | 1.008 | 1.018 |

This will produce a table where:

`estimate`

is the odds ratio`std.error`

is the standard error of the odds ratio`statistic`

is the \(z\)-statistic`p.value`

is the significance`conf.low`

is the lower level of the 95% confidence interval for the odds ratios`conf.high`

is the upper level of the 95% confidence interval for the odds ratios

Note that odds ratios are simply the exponentiated coefficients from the logit model. For example, the coefficient for `educ`

was -.252. The odds ratio is \(\exp(-.252) = .777\). The standard errors for the odds ratio are based on the delta method. The 95% confidence interval around the odds ratios are the exponentiated 95% confidence intervals from the logit model.

## Interpretation of Odds Ratios

The coefficients returned by our logit model are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as

\[ \% \text{ Change in Odds} = 100(OR - 1) \]

For example, the odds of voting for Trump are \(100(.701 - 1) = -29.9\%\) lower for females compared to males. In addition, each increase on the education scale leads to a \(100(.777 - 1) = -22.3\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.013 - 1) = 1.3\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).

## Interpretation in Terms of Predicted Probabilities

Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:

\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]

It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males in females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.

In R we can find predicted probabilities using the `augment`

function from the `broom`

package, which will append predicted probabilities from our model to any data frame we provide it. Let’s say we wanted to get predicted probabilities for both genders across the range of ages 20-70, holding `educ = 4`

(college degree). We can specify:

```
at_vals <- expand.grid(age = seq(20, 75, by = 5),
gender_factor = c("Male", "Female"),
educ = 4)
pred_prob <- augment(mod_log, type.predict = 'response', newdata = at_vals) %>%
mutate(lower = .fitted - 1.96*.se.fit,
upper = .fitted + 1.96*.se.fit) %>%
mutate_if(is.numeric, ~ round(.,3))
```

The `expand.grid`

function will create a new object that contains every combination of age, gender, and education levels that are provided as arguments. Thus, it is possible to create a new data file with the specific combination of values in which we are interested. The `augment`

function will then append the predicted probability of voting for Trump to every one of the combinations that are specified. Note the variable names in `expand.grid`

must exactly match the names of the independent variables specified in the previous call to `glm`

. The resulting table is following:

```
pred_prob %>%
kable(col.names = c("Age",
"Gender",
"Education",
"Pred. Prob",
"SE",
"Lower",
"Upper"),
align = 'c')
```

Age | Gender | Education | Pred. Prob | SE | Lower | Upper |
---|---|---|---|---|---|---|

20 | Male | 4 | 0.385 | 0.024 | 0.337 | 0.432 |

25 | Male | 4 | 0.400 | 0.022 | 0.356 | 0.444 |

30 | Male | 4 | 0.416 | 0.021 | 0.376 | 0.456 |

35 | Male | 4 | 0.432 | 0.019 | 0.395 | 0.469 |

40 | Male | 4 | 0.448 | 0.018 | 0.413 | 0.483 |

45 | Male | 4 | 0.464 | 0.017 | 0.431 | 0.497 |

50 | Male | 4 | 0.481 | 0.016 | 0.448 | 0.513 |

55 | Male | 4 | 0.497 | 0.016 | 0.465 | 0.529 |

60 | Male | 4 | 0.513 | 0.017 | 0.480 | 0.547 |

65 | Male | 4 | 0.530 | 0.018 | 0.494 | 0.565 |

70 | Male | 4 | 0.546 | 0.020 | 0.507 | 0.584 |

75 | Male | 4 | 0.562 | 0.021 | 0.520 | 0.604 |

20 | Female | 4 | 0.304 | 0.021 | 0.263 | 0.346 |

25 | Female | 4 | 0.318 | 0.020 | 0.280 | 0.357 |

30 | Female | 4 | 0.333 | 0.018 | 0.297 | 0.369 |

35 | Female | 4 | 0.347 | 0.017 | 0.314 | 0.381 |

40 | Female | 4 | 0.362 | 0.016 | 0.331 | 0.394 |

45 | Female | 4 | 0.378 | 0.015 | 0.348 | 0.407 |

50 | Female | 4 | 0.393 | 0.015 | 0.364 | 0.422 |

55 | Female | 4 | 0.409 | 0.015 | 0.379 | 0.439 |

60 | Female | 4 | 0.425 | 0.016 | 0.393 | 0.456 |

65 | Female | 4 | 0.441 | 0.017 | 0.407 | 0.475 |

70 | Female | 4 | 0.457 | 0.019 | 0.420 | 0.494 |

75 | Female | 4 | 0.473 | 0.021 | 0.432 | 0.514 |

In the table:

- The first three columns are the values of the independent variables for which the predicted probability is calculated.
`Pred. Prob`

is the predicted probability.`SE`

is the standard error of the predicted probability.`Lower`

and`Upper`

give the 95% confidence interval around the predicted probability.

The probability that a 35-year-old, college-educated male votes for Trump is .432, 95% CI = [.395, .469].

This is a lot of output, so we can better represent it with a plot.

```
plotdata <- pred_prob %>%
rename(
predprob = .fitted,
se = .se.fit,
gender = gender_factor)
plotdata %>%
ggplot(aes(x = age,
y = predprob,
color = gender)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .1) +
xlab("Age") +
ylab("Predicted Probability of Voting for Trump") +
labs(color = "Gender")
```

We get the predicted probabilities plotted across the range of ages, with separate lines for male and female, holding education constant at a college degree. At every five years, we also get error bars corresponding to the 95% confidence interval around the predicted probability. Based on the model, the probability of voting for Trump increases in age, but it is always higher for males than females.