This post outlines the steps for performing a logistic regression in Stata. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.
The steps that will be covered are the following:
- Check variable codings and distributions
- Graphically review bivariate associations
- Fit the logit model in Stata
- Interpret results in terms of odds ratios
- Interpret results in terms of predicted probabilities
The variables we use will be:
vote
: Whether the respondent voted for Clinton or Trumpgender
: Male or Femaleage
: The age (in years) of the respondenteduc
: The highest level of education attained
For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).
Univariate Summaries
The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. Note that in Stata, a binary outcome modeled using logistic regression needs to be coded as zero and one. The variable vote
is the dependent variable. How is this coded? We can check using the tab
command:
tab vote
vote | Freq. Percent Cum.
------------+-----------------------------------
Clinton | 1,269 52.01 52.01
Trump | 1,171 47.99 100.00
------------+-----------------------------------
Total | 2,440 100.00
The problem is that we don’t see the numeric value, just the label. There are a few workarounds, for example using the nolab
option to tab
or looking at label list
. Here is what these look like:
tab vote, nolab
vote | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,269 52.01 52.01
2 | 1,171 47.99 100.00
------------+-----------------------------------
Total | 2,440 100.00
label list vote
vote:
1 Clinton
2 Trump
A handy alternative is the add-on command fre
, which can be installed by simply typing:
ssc install fre
Running the command, we get the following useful output:
fre vote
vote
---------------------------------------------------------------
| Freq. Percent Valid Cum.
------------------+--------------------------------------------
Valid 1 Clinton | 1269 52.01 52.01 52.01
2 Trump | 1171 47.99 47.99 100.00
Total | 2440 100.00 100.00
---------------------------------------------------------------
We will need to recode the variable. Our interest is in modeling the probability of voting for Trump, so Trump needs to be coded as 1; Clinton will be coded as 0. It is never a good idea to remove a variable from a data file in case you want to return to the original coding later, so we will create a new variable and add value labels.
gen vote_2 = vote - 1
label define vote_2 0 "Clinton" 1 "Trump"
label val vote_2 vote_2
label var vote_2 "2016 Vote (1 = Trump, 0 = Clinton)"
Check our recode:
tab vote vote_2
| 2016 Vote (1 = Trump,
| 0 = Clinton)
vote | Clinton Trump | Total
-----------+----------------------+----------
Clinton | 1,269 0 | 1,269
Trump | 0 1,171 | 1,171
-----------+----------------------+----------
Total | 1,269 1,171 | 2,440
Adding variable labels to our other variables will make Stata graphs and output easier to read.
label var educ "Education"
label var age "Age"
label var gender "Gender"
Take a look at how the categorical variables are coded:
fre educ
educ -- Education
-----------------------------------------------------------------------------
| Freq. Percent Valid Cum.
--------------------------------+--------------------------------------------
Valid 1 HS Not Completed | 102 4.18 4.21 4.21
2 Completed HS | 381 15.61 15.72 19.93
3 College < 4 Years | 838 34.34 34.57 54.50
4 College 4 Year Degree | 624 25.57 25.74 80.24
5 Advanced Degree | 479 19.63 19.76 100.00
Total | 2424 99.34 100.00
Missing . | 16 0.66
Total | 2440 100.00
-----------------------------------------------------------------------------
fre gender
gender -- Gender
--------------------------------------------------------------
| Freq. Percent Valid Cum.
-----------------+--------------------------------------------
Valid 1 Male | 1128 46.23 46.23 46.23
2 Female | 1312 53.77 53.77 100.00
Total | 2440 100.00 100.00
--------------------------------------------------------------
We can also check a summary of the distribution of age. The detail
option to the sum
command gives us a fuller sense of the distribution.
sum age, detail
Age
-------------------------------------------------------------
Percentiles Smallest
1% 19 18
5% 24 18
10% 28 18 Obs 2,384
25% 37 18 Sum of Wgt. 2,384
50% 54 Mean 51.99832
Largest Std. Dev. 17.19011
75% 65 90
90% 74 90 Variance 295.4998
95% 79 90 Skewness -.0547124
99% 88 90 Kurtosis 2.113201
The Percentiles
column gives us the values at different percentiles of the distribution. For example, the median (50th percentile) age is 54, and the interquartile range (25th to 75th percentiles) runs from 37 to 65. The Smallest
values are the five lowest observed ages, which are all 18. The Largest
values are the five largest, which are all 90. These numbers are based on 2,384 observations. The mean age is 52 with a standard deviation of 17.19. Variance is the standard deviation squared, skewness is a measure of how non-symmetric the distribution is (values close to zero mean minimal skew). Kurtosis measures how long the tails are relative to a normal distribution (values close to 3 mean approximately normal). The value less than 3 means that the tails are shorter than a typical normal.
Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.
graph bar, over(vote_2)
In the sample, Clinton received more votes than Trump, but not by a large amount.
Now the categorical independent variables. Note the options for the education graph to make the x-axis labels more readable.
graph bar, over(vote_2)
We see that our sample has more females than males.
graph bar, over(educ, label(angle(forty_five) labsize(small)))
Within our sample, the modal respondent has some college, with the second most populated category being college educated.
For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram, sets the number of bins to be 20, and requests the y-axis to be percent of the sample falling into each bin.
histogram age, bin(20) percent
We now have a good sense as to the shape of the variable distributions and do not see any evidence that recodes are necessary.
Bivariate Summaries
Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.
Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.
graph bar, over(vote_2) over(gender)
The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.
A similar figure can be made for education. Again, we format for readability.
graph bar, over(vote_2, label(labsize(small))) ///
over(educ, label(labsize(vsmall)))
The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases in education.
Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale. To keep it clear that the vote is the outcome variable, we’ll set the orientation as horizontal.
graph hbox age, over(vote_2)
There’s a lot of overlap between the two boxes, though the Trump box sits a little to the right of the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.
Having carefully reviewed the data, we can now move to estimating the model.
Fitting the Model
Stata has two commands for fitting a logistic regression, logit
and logistic
. The difference is only in the default output. The logit
command reports coefficients on the log-odds scale, whereas logistic
reports odds ratios. The syntax for the logit
command is the following:
logit vote_2 i.gender educ age
After specifying logit
, the dependent variable is listed first followed by the independent variables. The i.
operator preceding gender
tells Stata that the variable is categorical, and Stata will automatically create the dummy variables for us. educ
is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is age
.
The output is the following:
Iteration 0: log likelihood = -1638.9088
Iteration 1: log likelihood = -1592.1454
Iteration 2: log likelihood = -1592.0911
Iteration 3: log likelihood = -1592.0911
Logistic regression Number of obs = 2,368
LR chi2(3) = 93.64
Prob > chi2 = 0.0000
Log likelihood = -1592.0911 Pseudo R2 = 0.0286
------------------------------------------------------------------------------
vote_2 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
Female | -.3559033 .0841313 -4.23 0.000 -.5207977 -.1910089
educ | -.2518405 .0387012 -6.51 0.000 -.3276934 -.1759876
age | .0130781 .0024576 5.32 0.000 .0082614 .0178949
_cons | .2754293 .1959139 1.41 0.160 -.1085549 .6594136
------------------------------------------------------------------------------
The syntax for logistic
is the same except that we swap out the name of the command.
logistic vote_2 i.gender educ age
The output is the following:
Logistic regression Number of obs = 2,368
LR chi2(3) = 93.64
Prob > chi2 = 0.0000
Log likelihood = -1592.0911 Pseudo R2 = 0.0286
------------------------------------------------------------------------------
vote_2 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
Female | .7005404 .0589374 -4.23 0.000 .5940465 .8261252
educ | .7773687 .0300851 -6.51 0.000 .7205839 .8386284
age | 1.013164 .0024899 5.32 0.000 1.008296 1.018056
_cons | 1.317096 .2580375 1.41 0.160 .8971296 1.933658
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Both commands estimate exactly the same model. Consequently, the output summarizing the number of observations, the likelihood ratio test of the model, and the pseudo \(R^2\) are all the same. The significant tests for the individual coefficients are also the same, as they are both based on the coefficients corresponding to the logit scale (the output from logit
). The odds ratios presented by logistic
are simply the exponentiated coefficients from logit
. For example, the coefficient for educ
was -.2518405. The odds ratio is \(\exp(-.2518405) = .7774\). The standard errors for the odds ratios are based on the delta method. The 95% confidence interval around the odds ratios are the exponentiated 95% confidence intervals from logit
. For example, the 95% confidence interval for educ
based on the logit
results was \([-.3276934, -.1759876]\). We get the odds ratio version as:
\[ \begin{align} 95\% \text{ CI} &= [\text{exp}(-.3276934), \text{exp}(-.1759876)] \\ &= [.7205839, .8386284] \end{align} \]
Note also that we can still get odds ratios from logit
if we specify the or
option:
logit vote_2 i.gender educ age, or
The output will be nearly identical to logistic
:
Iteration 0: log likelihood = -1638.9088
Iteration 1: log likelihood = -1592.1454
Iteration 2: log likelihood = -1592.0911
Iteration 3: log likelihood = -1592.0911
Logistic regression Number of obs = 2,368
LR chi2(3) = 93.64
Prob > chi2 = 0.0000
Log likelihood = -1592.0911 Pseudo R2 = 0.0286
------------------------------------------------------------------------------
vote_2 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
Female | .7005404 .0589374 -4.23 0.000 .5940465 .8261252
educ | .7773687 .0300851 -6.51 0.000 .7205839 .8386284
age | 1.013164 .0024899 5.32 0.000 1.008296 1.018056
_cons | 1.317096 .2580375 1.41 0.160 .8971296 1.933658
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Interpretation of Odds Ratios
The coefficients returned by logit
are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as
\[ \% \text{ Change in Odds} = 100(OR - 1) \]
For example, the odds of voting for Trump are \(100(.7005 - 1) = -29.95\%\) lower for females compared to males. In addition, each increase on the education scale leads to a \(100(.7774 - 1) = -22.25\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.3171 - 1) = 31.71\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).
Interpretation in Terms of Predicted Probabilities
Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:
\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]
It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent variable having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males and females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.
Stata makes it easy to determine predicted probabilities for any combination of independent variables using the margins
command. For example, say we want to know the probability that a 35-year-old college-educated male votes for Trump. The syntax would be:
margins, at(gender = 1 age = 35 educ = 4)
Adjusted predictions Number of obs = 2,368
Model VCE : OIM
Expression : Pr(vote_2), predict()
at : gender = 1
educ = 4
age = 35
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .4318766 .0189669 22.77 0.000 .3947023 .469051
------------------------------------------------------------------------------
Since gender
is a factor, we could get the probabilities for both females and males if we specify the syntax as:
margins gender, at(age = 35 educ = 4)
Adjusted predictions Number of obs = 2,368
Model VCE : OIM
Expression : Pr(vote_2), predict()
at : educ = 4
age = 35
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
Male | .4318766 .0189669 22.77 0.000 .3947023 .469051
Female | .3474874 .0169877 20.46 0.000 .3141921 .3807827
------------------------------------------------------------------------------
Factor variables can be specified after the margins
command and before the options, but covariates (non-categorical predictors) must be specified using the at
option. For both outputs, the value in the Margin
column is the predicted probability. The delta-method standard error provides a measure of uncertainty around the estimate and is used to calculate the \(z\)-statistic and confidence interval. The \(z\)-statistic and \(p\)-value test the null hypothesis that the probability is zero, which for our needs is not a meaningful test. The 95% confidence interval, on the other hand, is useful for understanding how much uncertainty we have in our predicted probabilities. The probability that a 35-year-old, college-educated male votes for Trump is .432, 95% CI = [.395, .469].
Let’s say we wanted to get predicted probabilities for both genders across the range of ages 20-70, holding educ = 4
(college degree). We can run the following code:
margins gender, at(age = (20(5)70) educ = 4)
The age = (20(5)70)
specifies that probabilities will be calculated from 20 to 70, incrementing by 5. The output is the following:
Adjusted predictions Number of obs = 2,368
Model VCE : OIM
Expression : Pr(vote_2), predict()
1._at : educ = 4
age = 20
2._at : educ = 4
age = 25
3._at : educ = 4
age = 30
4._at : educ = 4
age = 35
5._at : educ = 4
age = 40
6._at : educ = 4
age = 45
7._at : educ = 4
age = 50
8._at : educ = 4
age = 55
9._at : educ = 4
age = 60
10._at : educ = 4
age = 65
11._at : educ = 4
age = 70
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at#gender |
1#Male | .3845286 .0241139 15.95 0.000 .3372662 .431791
1#Female | .3044336 .0212145 14.35 0.000 .2628539 .3460132
2#Male | .4001165 .022273 17.96 0.000 .3564622 .4437707
2#Female | .3184546 .0197262 16.14 0.000 .2797919 .3571172
3#Male | .4159093 .0205304 20.26 0.000 .3756703 .4561482
3#Female | .3328123 .0182915 18.19 0.000 .2969617 .368663
4#Male | .4318766 .0189669 22.77 0.000 .3947023 .469051
4#Female | .3474874 .0169877 20.46 0.000 .3141921 .3807827
5#Male | .4479868 .0176794 25.34 0.000 .4133358 .4826378
5#Female | .362458 .0159139 22.78 0.000 .3312674 .3936487
6#Male | .4642069 .0167739 27.67 0.000 .4313307 .4970832
6#Female | .3777003 .0151853 24.87 0.000 .3479376 .4074629
7#Male | .4805031 .0163451 29.40 0.000 .4484673 .5125389
7#Female | .3931882 .0149141 26.36 0.000 .3639572 .4224193
8#Male | .4968409 .016448 30.21 0.000 .4646035 .5290783
8#Female | .4088939 .0151765 26.94 0.000 .3791484 .4386394
9#Male | .5131855 .0170764 30.05 0.000 .4797163 .5466547
9#Female | .4247878 .0159854 26.57 0.000 .3934571 .4561186
10#Male | .5295019 .0181667 29.15 0.000 .4938958 .565108
10#Female | .4408388 .017289 25.50 0.000 .406953 .4747246
11#Male | .5457555 .0196217 27.81 0.000 .5072976 .5842133
11#Female | .4570144 .0189972 24.06 0.000 .4197806 .4942482
------------------------------------------------------------------------------
The top of the output provides a key for interpreting the table. For example, where the table reads 3#Female
, we have the probability of voting for Trump among 35-year-old females. This is a lot of output, so Stata provides the extraordinarily useful marginsplot
command, which can be called after running any margins
command.
marginsplot
We get the predicted probabilities plotted across the range of ages, with separate lines for male and female, holding education constant at a college degree. At every five years, we also get error bars corresponding to the 95% confidence interval around the predicted probability. Based on the model, the probability of voting for Trump increases with age, but it is always higher for males than females.
Still have questions? Contact us!