Logistic Regression in Stata

Jeremy Albright

Posted on
Stata logisitic regression logit odds ratios predicted probabilities

This post outlines the steps for performing a logistic regression in Stata. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

  1. Check variable codings and distributions
  2. Graphically review bivariate associations
  3. Fit the logit model in Stata
  4. Interpret results in terms of odds ratios
  5. Interpret results in terms of predicted probabilities

The variables we use will be:

  • vote: Whether the respondent voted for Clinton or Trump
  • gender: Male or Female
  • age: The age (in years) of the respondent
  • educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

Univariate Summaries

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. Note that in Stata, a binary outcome modeled using logistic regression needs to be coded as zero and one. The variable vote is the dependent variable. How is this coded? We can check using the tab command:

tab vote
       vote |      Freq.     Percent        Cum.
------------+-----------------------------------
    Clinton |      1,269       52.01       52.01
      Trump |      1,171       47.99      100.00
------------+-----------------------------------
      Total |      2,440      100.00

The problem is that we don’t see the numeric value, just the label. There are a few workarounds, for example using the nolab option to tab or looking at label list. Here is what these look like:

tab vote, nolab
       vote |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,269       52.01       52.01
          2 |      1,171       47.99      100.00
------------+-----------------------------------
      Total |      2,440      100.00
label list vote
vote:
           1 Clinton
           2 Trump

A handy alternative is the add-on command fre, which can be installed by simply typing:

ssc install fre

Running the command, we get the following useful output:

fre vote
vote
---------------------------------------------------------------
                  |      Freq.    Percent      Valid       Cum.
------------------+--------------------------------------------
Valid   1 Clinton |       1269      52.01      52.01      52.01
        2 Trump   |       1171      47.99      47.99     100.00
        Total     |       2440     100.00     100.00           
---------------------------------------------------------------

We will need to recode the variable. Our interest is in modeling the probability of voting for Trump, so Trump needs to be coded as 1; Clinton will be coded as 0. It is never a good idea to remove a variable from a data file in case you want to return to the original coding later, so we will create a new variable and add value labels.

gen vote_2 = vote - 1

label define vote_2 0 "Clinton" 1 "Trump"

label val vote_2 vote_2

label var vote_2 "2016 Vote (1 = Trump, 0 = Clinton)"

Check our recode:

tab vote vote_2
           | 2016 Vote (1 = Trump,
           |     0 = Clinton)
      vote |   Clinton      Trump |     Total
-----------+----------------------+----------
   Clinton |     1,269          0 |     1,269 
     Trump |         0      1,171 |     1,171 
-----------+----------------------+----------
     Total |     1,269      1,171 |     2,440 

Adding variable labels to our other variables will make Stata graphs and output easier to read.

label var educ   "Education"
label var age    "Age"
label var gender "Gender"

Take a look at how the categorical variables are coded:

fre educ
educ -- Education
-----------------------------------------------------------------------------
                                |      Freq.    Percent      Valid       Cum.
--------------------------------+--------------------------------------------
Valid   1 HS Not Completed      |        102       4.18       4.21       4.21
        2 Completed HS          |        381      15.61      15.72      19.93
        3 College < 4 Years     |        838      34.34      34.57      54.50
        4 College 4 Year Degree |        624      25.57      25.74      80.24
        5 Advanced Degree       |        479      19.63      19.76     100.00
        Total                   |       2424      99.34     100.00           
Missing .                       |         16       0.66                      
Total                           |       2440     100.00                      
-----------------------------------------------------------------------------
fre gender
gender -- Gender
--------------------------------------------------------------
                 |      Freq.    Percent      Valid       Cum.
-----------------+--------------------------------------------
Valid   1 Male   |       1128      46.23      46.23      46.23
        2 Female |       1312      53.77      53.77     100.00
        Total    |       2440     100.00     100.00           
--------------------------------------------------------------

We can also check a summary of the distribution of age. The detail option to the sum command gives us a fuller sense of the distribution.

sum age, detail
                             Age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           19             18
 5%           24             18
10%           28             18       Obs               2,384
25%           37             18       Sum of Wgt.       2,384

50%           54                      Mean           51.99832
                        Largest       Std. Dev.      17.19011
75%           65             90
90%           74             90       Variance       295.4998
95%           79             90       Skewness      -.0547124
99%           88             90       Kurtosis       2.113201

The Percentiles column gives us the values at different percentiles of the distribution. For example, the median (50th percentile) age is 54, and the interquartile range (25th to 75th percentiles) runs from 37 to 65. The Smallest values are the five lowest observed ages, which are all 18. The Largest values are the five largest, which are all 90. These numbers are based on 2,384 observations. The mean age is 52 with a standard deviation of 17.19. Variance is the standard deviation squared, skewness is a measure of how non-symmetric the distribution is (values close to zero mean minimal skew). Kurtosis measures how long the tails are relative to a normal distribution (values close to 3 mean approximately normal). The value less than 3 means that the tails are shorter than a typical normal.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

graph bar, over(vote_2)

Bar plot of votes by candidate

In the sample, Clinton received more votes than Trump, but not by a large amount.

Now the categorical independent variables. Note the options for the education graph to make the x-axis labels more readable.

graph bar, over(vote_2)

Bar plot of votes by gender

We see that our sample has more females than males.

graph bar, over(educ, label(angle(forty_five) labsize(small)))

Bar plot of votes by education

Within our sample, the modal respondent has some college, with the second most populated category being college educated.

For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram, sets the number of bins to be 20, and requests the y-axis to be percent of the sample falling into each bin.

histogram age, bin(20) percent 

Histogram of age

We now have a good sense as to the shape of the variable distributions and do not see any evidence that recodes are necessary.

Bivariate Summaries

Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.

Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.

graph bar, over(vote_2) over(gender)

Bar plot of votes by candidate and gender

The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.

A similar figure can be made for education. Again, we format for readability.

graph bar, over(vote_2, label(labsize(small))) ///
    over(educ, label(labsize(vsmall)))

Bar plot of votes by candidate and education

The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases in education.

Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale. To keep it clear that the vote is the outcome variable, we’ll set the orientation as horizontal.

graph hbox age, over(vote_2)

Box plot of age by candidate

There’s a lot of overlap between the two boxes, though the Trump box sits a little to the right of the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.

Having carefully reviewed the data, we can now move to estimating the model.

Fitting the Model

Stata has two commands for fitting a logistic regression, logit and logistic. The difference is only in the default output. The logit command reports coefficients on the log-odds scale, whereas logistic reports odds ratios. The syntax for the logit command is the following:

logit vote_2 i.gender educ age

After specifying logit, the dependent variable is listed first followed by the independent variables. The i. operator preceding gender tells Stata that the variable is categorical, and Stata will automatically create the dummy variables for us. educ is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is age.

The output is the following:

Iteration 0:   log likelihood = -1638.9088  
Iteration 1:   log likelihood = -1592.1454  
Iteration 2:   log likelihood = -1592.0911  
Iteration 3:   log likelihood = -1592.0911  

Logistic regression                             Number of obs     =      2,368
                                                LR chi2(3)        =      93.64
                                                Prob > chi2       =     0.0000
Log likelihood = -1592.0911                     Pseudo R2         =     0.0286

------------------------------------------------------------------------------
      vote_2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     Female  |  -.3559033   .0841313    -4.23   0.000    -.5207977   -.1910089
        educ |  -.2518405   .0387012    -6.51   0.000    -.3276934   -.1759876
         age |   .0130781   .0024576     5.32   0.000     .0082614    .0178949
       _cons |   .2754293   .1959139     1.41   0.160    -.1085549    .6594136
------------------------------------------------------------------------------

The syntax for logistic is the same except that we swap out the name of the command.

logistic vote_2 i.gender educ age

The output is the following:

Logistic regression                             Number of obs     =      2,368
                                                LR chi2(3)        =      93.64
                                                Prob > chi2       =     0.0000
Log likelihood = -1592.0911                     Pseudo R2         =     0.0286

------------------------------------------------------------------------------
      vote_2 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     Female  |   .7005404   .0589374    -4.23   0.000     .5940465    .8261252
        educ |   .7773687   .0300851    -6.51   0.000     .7205839    .8386284
         age |   1.013164   .0024899     5.32   0.000     1.008296    1.018056
       _cons |   1.317096   .2580375     1.41   0.160     .8971296    1.933658
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Both commands estimate exactly the same model. Consequently, the output summarizing the number of observations, the likelihood ratio test of the model, and the pseudo \(R^2\) are all the same. The significant tests for the individual coefficients are also the same, as they are both based on the coefficients corresponding to the logit scale (the output from logit). The odds ratios presented by logistic are simply the exponentiated coefficients from logit. For example, the coefficient for educ was -.2518405. The odds ratio is \(\exp(-.2518405) = .7774\). The standard errors for the odds ratios are based on the delta method. The 95% confidence interval around the odds ratios are the exponentiated 95% confidence intervals from logit. For example, the 95% confidence interval for educ based on the logit results was \([-.3276934, -.1759876]\). We get the odds ratio version as:

\[ \begin{align} 95\% \text{ CI} &= [\text{exp}(-.3276934), \text{exp}(-.1759876)] \\ &= [.7205839, .8386284] \end{align} \]

Note also that we can still get odds ratios from logit if we specify the or option:

logit vote_2 i.gender educ age, or

The output will be nearly identical to logistic:

Iteration 0:   log likelihood = -1638.9088  
Iteration 1:   log likelihood = -1592.1454  
Iteration 2:   log likelihood = -1592.0911  
Iteration 3:   log likelihood = -1592.0911  

Logistic regression                             Number of obs     =      2,368
                                                LR chi2(3)        =      93.64
                                                Prob > chi2       =     0.0000
Log likelihood = -1592.0911                     Pseudo R2         =     0.0286

------------------------------------------------------------------------------
      vote_2 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     Female  |   .7005404   .0589374    -4.23   0.000     .5940465    .8261252
        educ |   .7773687   .0300851    -6.51   0.000     .7205839    .8386284
         age |   1.013164   .0024899     5.32   0.000     1.008296    1.018056
       _cons |   1.317096   .2580375     1.41   0.160     .8971296    1.933658
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Interpretation of Odds Ratios

The coefficients returned by logit are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as

\[ \% \text{ Change in Odds} = 100(OR - 1) \]

For example, the odds of voting for Trump are \(100(.7005 - 1) = -29.95\%\) lower for females compared to males. In addition, each increase on the education scale leads to a \(100(.7774 - 1) = -22.25\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.3171 - 1) = 31.71\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).

Interpretation in Terms of Predicted Probabilities

Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:

\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]

It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent variable having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males and females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.

Stata makes it easy to determine predicted probabilities for any combination of independent variables using the margins command. For example, say we want to know the probability that a 35-year-old college-educated male votes for Trump. The syntax would be:

margins, at(gender = 1 age = 35 educ = 4)
Adjusted predictions                            Number of obs     =      2,368
Model VCE    : OIM

Expression   : Pr(vote_2), predict()
at           : gender          =           1
               educ            =           4
               age             =          35

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   .4318766   .0189669    22.77   0.000     .3947023     .469051
------------------------------------------------------------------------------

Since gender is a factor, we could get the probabilities for both females and males if we specify the syntax as:

margins gender, at(age = 35 educ = 4)
Adjusted predictions                            Number of obs     =      2,368
Model VCE    : OIM

Expression   : Pr(vote_2), predict()
at           : educ            =           4
               age             =          35

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
       Male  |   .4318766   .0189669    22.77   0.000     .3947023     .469051
     Female  |   .3474874   .0169877    20.46   0.000     .3141921    .3807827
------------------------------------------------------------------------------

Factor variables can be specified after the margins command and before the options, but covariates (non-categorical predictors) must be specified using the at option. For both outputs, the value in the Margin column is the predicted probability. The delta-method standard error provides a measure of uncertainty around the estimate and is used to calculate the \(z\)-statistic and confidence interval. The \(z\)-statistic and \(p\)-value test the null hypothesis that the probability is zero, which for our needs is not a meaningful test. The 95% confidence interval, on the other hand, is useful for understanding how much uncertainty we have in our predicted probabilities. The probability that a 35-year-old, college-educated male votes for Trump is .432, 95% CI = [.395, .469].

Let’s say we wanted to get predicted probabilities for both genders across the range of ages 20-70, holding educ = 4 (college degree). We can run the following code:

margins gender, at(age = (20(5)70) educ = 4)

The age = (20(5)70) specifies that probabilities will be calculated from 20 to 70, incrementing by 5. The output is the following:

Adjusted predictions                            Number of obs     =      2,368
Model VCE    : OIM

Expression   : Pr(vote_2), predict()

1._at        : educ            =           4
               age             =          20

2._at        : educ            =           4
               age             =          25

3._at        : educ            =           4
               age             =          30

4._at        : educ            =           4
               age             =          35

5._at        : educ            =           4
               age             =          40

6._at        : educ            =           4
               age             =          45

7._at        : educ            =           4
               age             =          50

8._at        : educ            =           4
               age             =          55

9._at        : educ            =           4
               age             =          60

10._at       : educ            =           4
               age             =          65

11._at       : educ            =           4
               age             =          70

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#gender |
     1#Male  |   .3845286   .0241139    15.95   0.000     .3372662     .431791
   1#Female  |   .3044336   .0212145    14.35   0.000     .2628539    .3460132
     2#Male  |   .4001165    .022273    17.96   0.000     .3564622    .4437707
   2#Female  |   .3184546   .0197262    16.14   0.000     .2797919    .3571172
     3#Male  |   .4159093   .0205304    20.26   0.000     .3756703    .4561482
   3#Female  |   .3328123   .0182915    18.19   0.000     .2969617     .368663
     4#Male  |   .4318766   .0189669    22.77   0.000     .3947023     .469051
   4#Female  |   .3474874   .0169877    20.46   0.000     .3141921    .3807827
     5#Male  |   .4479868   .0176794    25.34   0.000     .4133358    .4826378
   5#Female  |    .362458   .0159139    22.78   0.000     .3312674    .3936487
     6#Male  |   .4642069   .0167739    27.67   0.000     .4313307    .4970832
   6#Female  |   .3777003   .0151853    24.87   0.000     .3479376    .4074629
     7#Male  |   .4805031   .0163451    29.40   0.000     .4484673    .5125389
   7#Female  |   .3931882   .0149141    26.36   0.000     .3639572    .4224193
     8#Male  |   .4968409    .016448    30.21   0.000     .4646035    .5290783
   8#Female  |   .4088939   .0151765    26.94   0.000     .3791484    .4386394
     9#Male  |   .5131855   .0170764    30.05   0.000     .4797163    .5466547
   9#Female  |   .4247878   .0159854    26.57   0.000     .3934571    .4561186
    10#Male  |   .5295019   .0181667    29.15   0.000     .4938958     .565108
  10#Female  |   .4408388    .017289    25.50   0.000      .406953    .4747246
    11#Male  |   .5457555   .0196217    27.81   0.000     .5072976    .5842133
  11#Female  |   .4570144   .0189972    24.06   0.000     .4197806    .4942482
------------------------------------------------------------------------------

The top of the output provides a key for interpreting the table. For example, where the table reads 3#Female, we have the probability of voting for Trump among 35-year-old females. This is a lot of output, so Stata provides the extraordinarily useful marginsplot command, which can be called after running any margins command.

marginsplot

STATA Margins Plot

We get the predicted probabilities plotted across the range of ages, with separate lines for male and female, holding education constant at a college degree. At every five years, we also get error bars corresponding to the 95% confidence interval around the predicted probability. Based on the model, the probability of voting for Trump increases with age, but it is always higher for males than females.