# Using R to Estimate a Logistic Regression Model

Nikki Kamouneh

Posted on
logisitic regression logit R

This post outlines the steps for performing a logistic regression in R. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

1. Check variable codings and distributions
2. Graphically review bivariate associations
3. Fit the logit model
4. Interpret results in terms of odds ratios
5. Interpret results in terms of predicted probabilities

The variables we use will be:

• vote: Whether the respondent voted for Clinton or Trump
• gender: Male or female
• age: The age (in years) of the respondent
• educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

## Univariate Summaries

Let’s first load the packages we will use and the data.

library(tidyverse)
library(sjlabelled)
library(haven)
library(knitr)
library(broom)

haven::zap_labels()

tbl <- datafull %>%
select(c("vote", "gender", "age", "educ"))

The data are in Stata (.dta) format, so use haven to read it in. Also, note that the read_dta function assumes that the data are saved in the data folder inside the current working directory. By default, read_dta will import the variable and value labels from Stata as variable attributes in R. This sometimes creates conflicts with certain R functions, we zap_labels() to remove them.

All of the variables we want to use are numeric. We will convert them to labeled factors to facilitate interpretation and the construction of graphs.

tbl <- tbl %>%
mutate(gender_factor = factor(gender,
levels = 1:2,
labels = c("Male", "Female")),
vote_factor   = factor(vote,
levels = 1:2,
labels = c("Clinton", "Trump")),
educ_factor   = factor(educ,
levels = 1:5,
labels = c("HS Not Completed",
"Completed HS",
"College <4 Years",
"College 4 Year Degree",
"Advanced Degree")))

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. The variable vote is the dependent variable. We can check the distribution of responses using the count function:

tbl %>%
count(vote_factor) %>%
mutate(prop = prop.table(n)) %>%
kable(align = c("l","c","c"),
col.names = c("Vote", "N", "Proportion"),
digits = 3)
Vote N Proportion
Clinton 1269 0.52
Trump 1171 0.48

Now review the distribution for the other categorical variables.

tbl %>%
count(gender_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Gender", "N", "Proportion"),
digits = 3)
Gender N Proportion
Male 1128 0.462
Female 1312 0.538
tbl %>%
count(educ_factor) %>%
mutate(prop = prop.table(n))%>%
kable(align = c("l","c","c"),
col.names = c("Education", "N", "Proportion"),
digits = 3)
Education N Proportion
HS Not Completed 102 0.042
Completed HS 381 0.156
College <4 Years 838 0.343
College 4 Year Degree 624 0.256
NA 16 0.007

We can also check out the distribution of age. We will create a table of summary statistics using the summarise function.

tbl %>%
summarise(Min    = min(age, na.rm = T),
Max    = max(age, na.rm = T),
Median = median(age, na.rm = T),
Mean   = mean(age, na.rm = T),
SD     = sd(age, na.rm = T)) %>%
mutate_all(~round(., 3)) %>%
kable(align = rep("c", 4))
Min Max Median Mean SD
18 90 54 51.998 17.19

The Min value is the lowest observed age, which is 18. The Max value is the largest, which is 90. The Median age is 54, and the Mean age is 52 with a standard deviation of 17.19.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

tbl %>%
ggplot(aes(x=vote_factor, y = ..prop.., group = 1)) +
geom_bar(fill = 'firebrick', color = 'black') +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
xlab("Vote") +
ylab("Proportion") +
ylim(0,.55)