This post outlines the steps for performing a logistic regression in R. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.
The steps that will be covered are the following:
- Check variable codings and distributions
- Graphically review bivariate associations
- Fit the logit model
- Interpret results in terms of odds ratios
- Interpret results in terms of predicted probabilities
The variables we use will be:
vote
: Whether the respondent voted for Clinton or Trumpgender
: Male or femaleage
: The age (in years) of the respondenteduc
: The highest level of education attained
For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).
Univariate Summaries
Let’s first load the packages we will use and the data.
library(tidyverse)
library(sjlabelled)
library(haven)
library(knitr)
library(broom)
datafull <- read_dta("data/cleaned-anes.dta") %>%
haven::zap_labels()
tbl <- datafull %>%
select(c("vote", "gender", "age", "educ"))
The data are in Stata (.dta) format, so use haven
to read it in. Also, note that the read_dta
function assumes that the data are saved in the data
folder inside the current working directory. By default, read_dta
will import the variable and value labels from Stata as variable attributes in R. This sometimes creates conflicts with certain R functions, we zap_labels()
to remove them.
All of the variables we want to use are numeric. We will convert them to labeled factors to facilitate interpretation and the construction of graphs.
tbl <- tbl %>%
mutate(gender_factor = factor(gender,
levels = 1:2,
labels = c("Male", "Female")),
vote_factor = factor(vote,
levels = 1:2,
labels = c("Clinton", "Trump")),
educ_factor = factor(educ,
levels = 1:5,
labels = c("HS Not Completed",
"Completed HS",
"College <4 Years",
"College 4 Year Degree",
"Advanced Degree")))
The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. The variable vote
is the dependent variable. We can check the distribution of responses using the count
function:
tbl %>%
count(vote_factor) %>%
mutate(prop = prop.table(n)) %>%
kable(align = c("l","c","c"),
col.names = c("Vote", "N", "Proportion"),
digits = 3)
Vote | N | Proportion |
---|---|---|
Clinton | 1269 | 0.52 |
Trump | 1171 | 0.48 |
Now review the distribution for the other categorical variables.
tbl %>%
count(gender_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Gender", "N", "Proportion"),
digits = 3)
Gender | N | Proportion |
---|---|---|
Male | 1128 | 0.462 |
Female | 1312 | 0.538 |
tbl %>%
count(educ_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Education", "N", "Proportion"),
digits = 3)
Education | N | Proportion |
---|---|---|
HS Not Completed | 102 | 0.042 |
Completed HS | 381 | 0.156 |
College <4 Years | 838 | 0.343 |
College 4 Year Degree | 624 | 0.256 |
Advanced Degree | 479 | 0.196 |
NA | 16 | 0.007 |
We can also check out the distribution of age. We will create a table of summary statistics using the summarise
function.
tbl %>%
summarise(Min = min(age, na.rm = T),
Max = max(age, na.rm = T),
Median = median(age, na.rm = T),
Mean = mean(age, na.rm = T),
SD = sd(age, na.rm = T)) %>%
mutate_all(~round(., 3)) %>%
kable(align = rep("c", 4))
Min | Max | Median | Mean | SD |
---|---|---|---|---|
18 | 90 | 54 | 51.998 | 17.19 |
The Min
value is the lowest observed age, which is 18. The Max
value is the largest, which is 90. The Median
age is 54, and the Mean
age is 52 with a standard deviation of 17.19.
Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.
tbl %>%
ggplot(aes(x=vote_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Vote") +
ylab("Proportion") +
ylim(0,.55)
In the sample, Clinton received more votes than Trump, but not by a large amount.
Now the categorical independent variables.
tbl %>%
ggplot(aes(x=gender_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.55)
We see that our sample has more females than males.
tbl %>%
ggplot(aes(x=educ_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.4)
Within our sample, the modal respondent has some college, with the second most populated category being college educated.
For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram.
tbl %>%
ggplot(aes(x=age)) +
geom_histogram(fill = 'firebrick', color = 'black', bins = 20) +
xlab("Age") +
ylab("Count")
tbl %>%
ggplot(aes(x=age, y = ..density.., group = 1)) +
stat_bin(binwidth = 5, fill = 'firebrick', color = 'black') +
xlab("Age") +
ylab("Proportion")
We now have a good sense as to what the distributions of all of our variables are and do not see any evidence that recodes are necessary.
Bivariate Summaries
Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.
Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.
tbl %>%
ggplot(aes(x=gender_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar( position = 'dodge') +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1)) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.65) +
labs(fill = "Vote")
The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.
A similar figure can be made for education.
tbl %>%
ggplot(aes(x=educ_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar(position = 'dodge') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1),
size = 3) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.45) +
labs(fill = "Vote")
The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases with education.
Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale.
tbl %>%
ggplot(aes(x = vote_factor,
y = age,
group = vote_factor,
fill = vote_factor)) +
geom_boxplot() +
coord_flip() +
ylab("Age") +
xlab("Vote") +
theme(legend.position = "none")
The coord_flip()
function is used to keep the dependent variable on the y-axis.
There’s a lot of overlap between the two boxes, though the Trump box sits a little higher than the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.
Having carefully reviewed the data, we can now move to estimating the model.
Fitting the Model
To fit a logistic regression in R, we will use the glm
function, which stands for Generalized Linear Model. Within this function, write the dependent variable, followed by ~
, and then the independent variables separated by +
’s. When the family is specified as binomial
, R defaults to fitting a logit model. Note that we are also treating educ
as a numeric variable here.
mod_log <- tbl %>%
mutate(vote = as.factor(vote)) %>%
glm(formula = vote ~ gender_factor + educ + age, family = binomial)
The broom
package contains useful functions, like tidy()
, for viewing the output from common models. The conf.int = T
argument requests confidence intervals.
mod_log %>%
tidy(conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 4)))
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.275 | 0.196 | 1.406 | 0.16 | -0.108 | 0.660 |
gender_factorFemale | -0.356 | 0.084 | -4.230 | 0.00 | -0.521 | -0.191 |
educ | -0.252 | 0.039 | -6.507 | 0.00 | -0.328 | -0.176 |
age | 0.013 | 0.002 | 5.322 | 0.00 | 0.008 | 0.018 |
Because gender_factor
is a factor variable, R will automatically create dummy variables for us. educ
is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is age
. We find that gender
, age
, and educ
have significant effects.
We can also look at our results in terms of odds ratios by setting exponentiate = T
in the tidy()
function.
mod_log %>%
tidy(exponentiate = T, conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 6)))
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.317 | 0.196 | 1.406 | 0.16 | 0.897 | 1.935 |
gender_factorFemale | 0.701 | 0.084 | -4.230 | 0.00 | 0.594 | 0.826 |
educ | 0.777 | 0.039 | -6.507 | 0.00 | 0.720 | 0.838 |
age | 1.013 | 0.002 | 5.322 | 0.00 | 1.008 | 1.018 |
This will produce a table where:
estimate
is the odds ratiostd.error
is the standard error of the odds ratiostatistic
is the \(z\)-statisticp.value
is the significanceconf.low
is the lower level of the 95% confidence interval for the odds ratiosconf.high
is the upper level of the 95% confidence interval for the odds ratios
Note that odds ratios are simply the exponentiated coefficients from the logit model. For example, the coefficient for educ
was -.252. The odds ratio is \(\exp(-.252) = .777\). The standard errors for the odds ratio are based on the delta method. The 95% confidence interval around the odds ratios are the exponentiated 95% confidence intervals from the logit model.
Interpretation of Odds Ratios
The coefficients returned by our logit model are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as
\[ \% \text{ Change in Odds} = 100(OR - 1) \]
For example, the odds of voting for Trump are \(100(.701 - 1) = -29.9\%\) lower for females compared to males. In addition, each increase on the education scale leads to a \(100(.777 - 1) = -22.3\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.013 - 1) = 1.3\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).
Interpretation in Terms of Predicted Probabilities
Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:
\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]
It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males in females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.
In R we can find predicted probabilities using the augment
function from the broom
package, which will append predicted probabilities from our model to any data frame we provide it. Let’s say we wanted to get predicted probabilities for both genders across the range of ages 20-70, holding educ = 4
(college degree). We can specify:
at_vals <- expand.grid(age = seq(20, 75, by = 5),
gender_factor = c("Male", "Female"),
educ = 4)
pred_prob <- augment(mod_log, type.predict = 'response', newdata = at_vals, se_fit = TRUE) %>%
mutate(lower = .fitted - 1.96*.se.fit,
upper = .fitted + 1.96*.se.fit) %>%
mutate_if(is.numeric, ~ round(.,3))
The expand.grid
function will create a new object that contains every combination of age, gender, and education levels that are provided as arguments. Thus, it is possible to create a new data file with the specific combination of values in which we are interested. The augment
function will then append the predicted probability of voting for Trump to every one of the combinations that are specified. Note the variable names in expand.grid
must exactly match the names of the independent variables specified in the previous call to glm
. The resulting table is following:
pred_prob %>%
kable(col.names = c("Age",
"Gender",
"Education",
"Pred. Prob",
"SE",
"Lower",
"Upper"),
align = 'c')
Age | Gender | Education | Pred. Prob | SE | Lower | Upper |
---|---|---|---|---|---|---|
20 | Male | 4 | 0.385 | 0.024 | 0.337 | 0.432 |
25 | Male | 4 | 0.400 | 0.022 | 0.356 | 0.444 |
30 | Male | 4 | 0.416 | 0.021 | 0.376 | 0.456 |
35 | Male | 4 | 0.432 | 0.019 | 0.395 | 0.469 |
40 | Male | 4 | 0.448 | 0.018 | 0.413 | 0.483 |
45 | Male | 4 | 0.464 | 0.017 | 0.431 | 0.497 |
50 | Male | 4 | 0.481 | 0.016 | 0.448 | 0.513 |
55 | Male | 4 | 0.497 | 0.016 | 0.465 | 0.529 |
60 | Male | 4 | 0.513 | 0.017 | 0.480 | 0.547 |
65 | Male | 4 | 0.530 | 0.018 | 0.494 | 0.565 |
70 | Male | 4 | 0.546 | 0.020 | 0.507 | 0.584 |
75 | Male | 4 | 0.562 | 0.021 | 0.520 | 0.604 |
20 | Female | 4 | 0.304 | 0.021 | 0.263 | 0.346 |
25 | Female | 4 | 0.318 | 0.020 | 0.280 | 0.357 |
30 | Female | 4 | 0.333 | 0.018 | 0.297 | 0.369 |
35 | Female | 4 | 0.347 | 0.017 | 0.314 | 0.381 |
40 | Female | 4 | 0.362 | 0.016 | 0.331 | 0.394 |
45 | Female | 4 | 0.378 | 0.015 | 0.348 | 0.407 |
50 | Female | 4 | 0.393 | 0.015 | 0.364 | 0.422 |
55 | Female | 4 | 0.409 | 0.015 | 0.379 | 0.439 |
60 | Female | 4 | 0.425 | 0.016 | 0.393 | 0.456 |
65 | Female | 4 | 0.441 | 0.017 | 0.407 | 0.475 |
70 | Female | 4 | 0.457 | 0.019 | 0.420 | 0.494 |
75 | Female | 4 | 0.473 | 0.021 | 0.432 | 0.514 |
In the table:
- The first three columns are the values of the independent variables for which the predicted probability is calculated.
Pred. Prob
is the predicted probability.SE
is the standard error of the predicted probability.Lower
andUpper
give the 95% confidence interval around the predicted probability.
The probability that a 35-year-old, college-educated male votes for Trump is .432, 95% CI = [.395, .469].
This is a lot of output, so we can better represent it with a plot.
plotdata <- pred_prob %>%
rename(
predprob = .fitted,
se = .se.fit,
gender = gender_factor)
plotdata %>%
ggplot(aes(x = age,
y = predprob,
color = gender)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .1) +
xlab("Age") +
ylab("Predicted Probability of Voting for Trump") +
labs(color = "Gender")
We get the predicted probabilities plotted across the range of ages, with separate lines for male and female, holding education constant at a college degree. At every five years, we also get error bars corresponding to the 95% confidence interval around the predicted probability. Based on the model, the probability of voting for Trump increases in age, but it is always higher for males than females.
Still have questions? Contact us!