# Logistic Regression in R

Nikki Kamouneh

Posted on
logisitic regression logit R

This post outlines the steps for performing a logistic regression in R. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

1. Check variable codings and distributions
2. Graphically review bivariate associations
3. Fit the logit model
4. Interpret results in terms of odds ratios
5. Interpret results in terms of predicted probabilities

The variables we use will be:

• vote: Whether the respondent voted for Clinton or Trump
• gender: Male or female
• age: The age (in years) of the respondent
• educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

## Univariate Summaries

Let’s first load the packages we will use and the data.

library(tidyverse)
library(sjlabelled)
library(haven)
library(knitr)
library(broom)

haven::zap_labels()

tbl      <- datafull %>%
select(c("vote", "gender", "age", "educ"))

The data are in Stata (.dta) format, so use haven to read it in. Also, note that the read_dta function assumes that the data are saved in the data folder inside the current working directory. By default, read_dta will import the variable and value labels from Stata as variable attributes in R. This sometimes creates conflicts with certain R functions, we zap_labels() to remove them.

All of the variables we want to use are numeric. We will convert them to labeled factors to facilitate interpretation and the construction of graphs.

tbl <- tbl %>%
mutate(gender_factor = factor(gender,
levels = 1:2,
labels = c("Male", "Female")),
vote_factor   = factor(vote,
levels = 1:2,
labels = c("Clinton", "Trump")),
educ_factor   = factor(educ,
levels = 1:5,
labels = c("HS Not Completed",
"Completed HS",
"College <4 Years",
"College 4 Year Degree",
"Advanced Degree")))

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. The variable vote is the dependent variable. We can check the distribution of responses using the count function:

tbl %>%
count(vote_factor) %>%
mutate(prop = prop.table(n)) %>%
kable(align = c("l","c","c"),
col.names = c("Vote", "N", "Proportion"),
digits = 3)
Vote N Proportion
Clinton 1269 0.52
Trump 1171 0.48

Now review the distribution for the other categorical variables.

tbl %>%
count(gender_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Gender", "N", "Proportion"),
digits = 3)
Gender N Proportion
Male 1128 0.462
Female 1312 0.538
tbl %>%
count(educ_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Education", "N", "Proportion"),
digits = 3)
Education N Proportion
HS Not Completed 102 0.042
Completed HS 381 0.156
College <4 Years 838 0.343
College 4 Year Degree 624 0.256
NA 16 0.007

We can also check out the distribution of age. We will create a table of summary statistics using the summarise function.

tbl %>%
summarise(Min    = min(age, na.rm = T),
Max    = max(age, na.rm = T),
Median = median(age, na.rm = T),
Mean   = mean(age, na.rm = T),
SD     = sd(age, na.rm = T)) %>%
mutate_all(~round(., 3)) %>%
kable(align = rep("c", 4))
Min Max Median Mean SD
18 90 54 51.998 17.19

The Min value is the lowest observed age, which is 18. The Max value is the largest, which is 90. The Median age is 54, and the Mean age is 52 with a standard deviation of 17.19.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

tbl %>%
ggplot(aes(x=vote_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Vote") +
ylab("Proportion") +
ylim(0,.55)

In the sample, Clinton received more votes than Trump, but not by a large amount.

Now the categorical independent variables.

tbl %>%
ggplot(aes(x=gender_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black')  +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.55)

We see that our sample has more females than males.

tbl %>%
ggplot(aes(x=educ_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.4)

Within our sample, the modal respondent has some college, with the second most populated category being college educated.

For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram.

tbl %>%
ggplot(aes(x=age)) +
geom_histogram(fill = 'firebrick', color = 'black', bins = 20) +
xlab("Age") +
ylab("Count")

tbl %>%
ggplot(aes(x=age, y = ..density.., group = 1)) +
stat_bin(binwidth = 5,  fill = 'firebrick', color = 'black') +
xlab("Age") +
ylab("Proportion")

We now have a good sense as to what the distributions of all of our variables are and do not see any evidence that recodes are necessary.

## Bivariate Summaries

Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.

Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.

tbl %>%
ggplot(aes(x=gender_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar( position = 'dodge') +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1)) +
xlab("Gender") +
ylab("Proportion") +
ylim(0,.65) +
labs(fill = "Vote")

The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.

A similar figure can be made for education.

tbl %>%
ggplot(aes(x=educ_factor,
y = ..prop..,
group = vote,
fill = vote_factor)) +
geom_bar(position = 'dodge') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(stat='count',
aes(label=..count..),
vjust=-1,
position = position_dodge(width = 1),
size = 3) +
xlab("Education") +
ylab("Proportion") +
ylim(0,.45) +
labs(fill = "Vote")

The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases with education.

Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale.

tbl %>%
ggplot(aes(x = vote_factor,
y = age,
group = vote_factor,
fill = vote_factor)) +
geom_boxplot() +
coord_flip() +
ylab("Age") +
xlab("Vote") +
theme(legend.position = "none")

The coord_flip() function is used to keep the dependent variable on the y-axis.

There’s a lot of overlap between the two boxes, though the Trump box sits a little higher than the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.

Having carefully reviewed the data, we can now move to estimating the model.

## Fitting the Model

To fit a logistic regression in R, we will use the glm function, which stands for Generalized Linear Model. Within this function, write the dependent variable, followed by ~, and then the independent variables separated by +’s. When the family is specified as binomial, R defaults to fitting a logit model. Note that we are also treating educ as a numeric variable here.

mod_log <-  tbl %>%
mutate(vote = as.factor(vote)) %>%
glm(formula = vote ~ gender_factor + educ + age, family = binomial)

The broom package contains useful functions, like tidy(), for viewing the output from common models. The conf.int = T argument requests confidence intervals.

mod_log %>%
tidy(conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 4)))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.275 0.196 1.406 0.16 -0.108 0.660
gender_factorFemale -0.356 0.084 -4.230 0.00 -0.521 -0.191
educ -0.252 0.039 -6.507 0.00 -0.328 -0.176
age 0.013 0.002 5.322 0.00 0.008 0.018

Because gender_factor is a factor variable, R will automatically create dummy variables for us. educ is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is age. We find that gender, age, and educ have significant effects.

We can also look at our results in terms of odds ratios by setting exponentiate = T in the tidy() function.

mod_log %>%
tidy(exponentiate = T, conf.int = T) %>%
mutate_if(is.numeric, ~ round(.,3)) %>%
kable(align = c("l", rep("c", 6)))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.317 0.196 1.406 0.16 0.897 1.935
gender_factorFemale 0.701 0.084 -4.230 0.00 0.594 0.826
educ 0.777 0.039 -6.507 0.00 0.720 0.838
age 1.013 0.002 5.322 0.00 1.008 1.018

This will produce a table where:

• estimate is the odds ratio
• std.error is the standard error of the odds ratio
• statistic is the $$z$$-statistic
• p.value is the significance
• conf.low is the lower level of the 95% confidence interval for the odds ratios
• conf.high is the upper level of the 95% confidence interval for the odds ratios

Note that odds ratios are simply the exponentiated coefficients from the logit model. For example, the coefficient for educ was -.252. The odds ratio is $$\exp(-.252) = .777$$. The standard errors for the odds ratio are based on the delta method. The 95% confidence interval around the odds ratios are the exponentiated 95% confidence intervals from the logit model.

## Interpretation of Odds Ratios

The coefficients returned by our logit model are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in $$x$$ leads to a decrease in the odds that $$y = 1$$. An odds ratio greater than one means that an increase in $$x$$ leads to an increase in the odds that $$y = 1$$. In general, the percent change in the odds given a one-unit change in the predictor can be determined as

$\% \text{ Change in Odds} = 100(OR - 1)$

For example, the odds of voting for Trump are $$100(.701 - 1) = -29.9\%$$ lower for females compared to males. In addition, each increase on the education scale leads to a $$100(.777 - 1) = -22.3\%$$ decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a $$100(1.013 - 1) = 1.3\%$$ increase in the odds of voting for Trump. All of these are statistically significant at $$p < .05$$.

## Interpretation in Terms of Predicted Probabilities

Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:

$\text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right)$

It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males in females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.

In R we can find predicted probabilities using the augment function from the broom package, which will append predicted probabilities from our model to any data frame we provide it. Let’s say we wanted to get predicted probabilities for both genders across the range of ages 20-70, holding educ = 4 (college degree). We can specify:

at_vals <- expand.grid(age = seq(20, 75, by = 5),
gender_factor = c("Male", "Female"),
educ = 4)

pred_prob <- augment(mod_log, type.predict = 'response', newdata = at_vals) %>%
mutate(lower = .fitted - 1.96*.se.fit,
upper = .fitted + 1.96*.se.fit) %>%
mutate_if(is.numeric, ~ round(.,3)) 

The expand.grid function will create a new object that contains every combination of age, gender, and education levels that are provided as arguments. Thus, it is possible to create a new data file with the specific combination of values in which we are interested. The augment function will then append the predicted probability of voting for Trump to every one of the combinations that are specified. Note the variable names in expand.grid must exactly match the names of the independent variables specified in the previous call to glm. The resulting table is following:

pred_prob %>%
kable(col.names = c("Age",
"Gender",
"Education",
"Pred. Prob",
"SE",
"Lower",
"Upper"),
align = 'c')
Age Gender Education Pred. Prob SE Lower Upper
20 Male 4 0.385 0.024 0.337 0.432
25 Male 4 0.400 0.022 0.356 0.444
30 Male 4 0.416 0.021 0.376 0.456
35 Male 4 0.432 0.019 0.395 0.469
40 Male 4 0.448 0.018 0.413 0.483
45 Male 4 0.464 0.017 0.431 0.497
50 Male 4 0.481 0.016 0.448 0.513
55 Male 4 0.497 0.016 0.465 0.529
60 Male 4 0.513 0.017 0.480 0.547
65 Male 4 0.530 0.018 0.494 0.565
70 Male 4 0.546 0.020 0.507 0.584
75 Male 4 0.562 0.021 0.520 0.604
20 Female 4 0.304 0.021 0.263 0.346
25 Female 4 0.318 0.020 0.280 0.357
30 Female 4 0.333 0.018 0.297 0.369
35 Female 4 0.347 0.017 0.314 0.381
40 Female 4 0.362 0.016 0.331 0.394
45 Female 4 0.378 0.015 0.348 0.407
50 Female 4 0.393 0.015 0.364 0.422
55 Female 4 0.409 0.015 0.379 0.439
60 Female 4 0.425 0.016 0.393 0.456
65 Female 4 0.441 0.017 0.407 0.475
70 Female 4 0.457 0.019 0.420 0.494
75 Female 4 0.473 0.021 0.432 0.514

In the table:

• The first three columns are the values of the independent variables for which the predicted probability is calculated.
• Pred. Prob is the predicted probability.
• SE is the standard error of the predicted probability.
• Lower and Upper give the 95% confidence interval around the predicted probability.

The probability that a 35-year-old, college-educated male votes for Trump is .432, 95% CI = [.395, .469].

This is a lot of output, so we can better represent it with a plot.

plotdata <- pred_prob %>%
rename(
predprob = .fitted,
se       = .se.fit,
gender   = gender_factor)

plotdata %>%
ggplot(aes(x = age,
y = predprob,
color = gender)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .1) +
xlab("Age") +
ylab("Predicted Probability of Voting for Trump") +
labs(color = "Gender")

We get the predicted probabilities plotted across the range of ages, with separate lines for male and female, holding education constant at a college degree. At every five years, we also get error bars corresponding to the 95% confidence interval around the predicted probability. Based on the model, the probability of voting for Trump increases in age, but it is always higher for males than females.