# Using Stata to Perform Logistic Regression

Jeremy Albright

Posted on
Stata logisitic regression logit odds ratios predicted probabilities

This post outlines the steps for performing a logistic regression in Stata. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

1. Check variable codings and distributions
2. Graphically review bivariate associations
3. Fit the logit model in Stata
4. Interpret results in terms of odds ratios
5. Interpret results in terms of predicted probabilities

The variables we use will be:

• vote: Whether the respondent voted for Clinton or Trump
• gender: Male or Female
• age: The age (in years) of the respondent
• educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

## Univariate Summaries

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. Note that in Stata, a binary outcome modeled using logistic regression needs to be coded as zero and one. The variable vote is the dependent variable. How is this coded? We can check using the tab command:

tab vote

       vote |      Freq.     Percent        Cum.
------------+-----------------------------------
Clinton |      1,269       52.01       52.01
Trump |      1,171       47.99      100.00
------------+-----------------------------------
Total |      2,440      100.00


The problem is that we don’t see the numeric value, just the label. There are a few workarounds, for example using the nolab option to tab or looking at label list. Here is what these look like:

tab vote, nolab

       vote |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |      1,269       52.01       52.01
2 |      1,171       47.99      100.00
------------+-----------------------------------
Total |      2,440      100.00
label list vote
vote:
1 Clinton
2 Trump

A handy alternative is the add-on command fre, which can be installed by simply typing:

ssc install fre


Running the command, we get the following useful output:

fre vote

vote
---------------------------------------------------------------
|      Freq.    Percent      Valid       Cum.
------------------+--------------------------------------------
Valid   1 Clinton |       1269      52.01      52.01      52.01
2 Trump   |       1171      47.99      47.99     100.00
Total     |       2440     100.00     100.00
---------------------------------------------------------------

We will need to recode the variable. Our interest is in modeling the probability of voting for Trump, so Trump needs to be coded as 1; Clinton will be coded as 0. It is never a good idea to remove a variable from a data file in case you want to return to the original coding later, so we will create a new variable and add value labels.

gen vote_2 = vote - 1

label define vote_2 0 "Clinton" 1 "Trump"

label val vote_2 vote_2

label var vote_2 "2016 Vote (1 = Trump, 0 = Clinton)"

Check our recode:

tab vote vote_2

           | 2016 Vote (1 = Trump,
|     0 = Clinton)
vote |   Clinton      Trump |     Total
-----------+----------------------+----------
Clinton |     1,269          0 |     1,269
Trump |         0      1,171 |     1,171
-----------+----------------------+----------
Total |     1,269      1,171 |     2,440


Adding variable labels to our other variables will make Stata graphs and output easier to read.

label var educ   "Education"
label var age    "Age"
label var gender "Gender"


Take a look at how the categorical variables are coded:

fre educ
educ -- Education
-----------------------------------------------------------------------------
|      Freq.    Percent      Valid       Cum.
--------------------------------+--------------------------------------------
Valid   1 HS Not Completed      |        102       4.18       4.21       4.21
2 Completed HS          |        381      15.61      15.72      19.93
3 College < 4 Years     |        838      34.34      34.57      54.50
4 College 4 Year Degree |        624      25.57      25.74      80.24
5 Advanced Degree       |        479      19.63      19.76     100.00
Total                   |       2424      99.34     100.00
Missing .                       |         16       0.66
Total                           |       2440     100.00
-----------------------------------------------------------------------------
fre gender
gender -- Gender
--------------------------------------------------------------
|      Freq.    Percent      Valid       Cum.
-----------------+--------------------------------------------
Valid   1 Male   |       1128      46.23      46.23      46.23
2 Female |       1312      53.77      53.77     100.00
Total    |       2440     100.00     100.00
--------------------------------------------------------------

We can also check a summary of the distribution of age. The detail option to the sum command gives us a fuller sense of the distribution.

sum age, detail
                             Age
-------------------------------------------------------------
Percentiles      Smallest
1%           19             18
5%           24             18
10%           28             18       Obs               2,384
25%           37             18       Sum of Wgt.       2,384

50%           54                      Mean           51.99832
Largest       Std. Dev.      17.19011
75%           65             90
90%           74             90       Variance       295.4998
95%           79             90       Skewness      -.0547124
99%           88             90       Kurtosis       2.113201


The Percentiles column gives us the values at different percentiles of the distribution. For example, the median (50th percentile) age is 54, and the interquartile range (25th to 75th percentiles) runs from 37 to 65. The Smallest values are the five lowest observed ages, which are all 18. The Largest values are the five largest, which are all 90. These numbers are based on 2,384 observations. The mean age is 52 with a standard deviation of 17.19. Variance is the standard deviation squared, skewness is a measure of how non-symmetric the distribution is (values close to zero mean minimal skew). Kurtosis measures how long the tails are relative to a normal distribution (values close to 3 mean approximately normal). The value less than 3 means that the tails are shorter than a typical normal.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

graph bar, over(vote_2)