Logistic Regression in SPSS

Nikki Kamouneh

Posted on
SPSS logit logisitic regression

This post outlines the steps for performing a logistic regression in SPSS. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

  1. Check variable codings and distributions
  2. Graphically review bivariate associations
  3. Fit the logit model in SPSS
  4. Interpret results in terms of odds ratios
  5. Interpret results in terms of predicted probabilities

The variables used will be:

  • vote: Whether the respondent voted for Clinton or Trump
  • gender: Male or female
  • age: The age (in years) of the respondent
  • educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

Univariate Summaries

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions.

Click Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Frequencies.

The Frequencies window will pop up.

Select vote, educ and gender as our variables and click OK. This gives us the following output:

Note that frequencies are the preferred summary for categorical (nominal and ordinal) variables. The first table provides the number of nonmissing observations for each variable we selected. vote has N = 2,440, educ has N = 2,424 with 16 missing values, and gender has N = 2,440. The next three tables provide frequencies for each variable. In each table:

  • Frequency is the number of observations in that level
  • Percent is that number out of the total
  • Valid Percent recalculates the percentages without missing values
  • Cumulative Percent sums each subsequent percentage adding up to 100

We can also check a summary of the distribution of age. We do this by clicking Analyze \(\rightarrow\) Descriptive Statistics \(\rightarrow\) Descriptives…

Add age as the variable and click OK.

The following output appears:

The Minimum value is the lowest observed age, which is 18. The Maximum value is the largest, which is 90. These numbers are based on 2,384 observations. The mean age is 52 with a standard deviation of 17.19.

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. We will do this one at a time for each variable using the SPSS Chart Builder. Go to Graphs \(\rightarrow\) Chart Builder…

Select a Simple Bar type, and select the variable vote as the x-axis variable.

Change the Statistic from count to percentage.

Click OK. We get the following output:

In the sample, Clinton received more votes than Trump, but not by a large amount.

Now turn to the categorical independent variables. We repeat the same process using educ and gender as the x-axis variables and get the following plots:

We see that our sample has more females than males.

Within our sample, the modal respondent has some college, with the second most populated category being college educated.

For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers.

We will do this using the Chart Builder again. In the chart options select Histogram.

Click and drag age onto the x-axis.

Click OK.

We now have a good sense as to what the distributions of all of our variables are and do not see any evidence that recodes are necessary.

Bivariate Summaries

Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.

Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. We will do this in the Chart Builder. Under Bar, select the clustered bar graph option. Select gender as the x-axis variable and vote as the cluster on X variable. Again, change the Statistic from count to percentage. Click OK.

The following is the graph of vote choice and gender.

The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.

A similar figure can be made for education. This time select educ as the x-axis variable.

The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases with education.

Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale. We will once again use the Chart Builder for this. Under Boxplot, select a Simple Boxplot. Add age as our y-axis variable and vote as the x-axis. Under Basic Elements, select Transpose so that the dependent variable is on the y-axis. Click OK.

There’s a lot of overlap between the two boxes, though the Trump box sits a little higher than the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.

Having carefully reviewed the data, we can now move to estimating the model.

Fitting the Model

To fit a logistic regression in SPSS, go to Analyze \(\rightarrow\) Regression \(\rightarrow\) Binary Logistic…

Select vote as the Dependent variable and educ, gender and age as Covariates.

Click Categorical. Select gender as a categorical covariate. SPSS will automatically create dummy variables for any variable specified as a factor, defaulting to the lowest value as the reference. The data are coded such that 1 = Male and 2 = Female, which means that Male is the reference. Click Continue.

Click Options. Check the CI for exp(B) box to request confidence intervals around the odds ratios. Click Continue, then click OK.

We will get the following output:

The first box reports an omnibus test for the whole model and indicates that all of our predictors are jointly significant. We are usually interested in the individual variables, so the omnibus test is not our primary interest. Note the values are all the same because only a single model was estimated. More information would be present if we had instead requested a stepwise model (that is, fitting subsequent models, adding or removing independent variables each time).

The second box provides overall model fit information. The \(R^2\) measures are two different attempts at simulating the \(R^2\) from linear regression in the context of a binary outcome.

The next box provides model estimates. B is the coefficient, SE is the standard error corresponding to B, Wald is the chi-square distributed test statistic, and Sig. is the corresponding \(p\)-value. Note that the odds ratios are simply the exponentiated coefficients from the logit model. For example, the coefficient for educ was -.252. The odds ratio is \(\exp(-.252) = .777\). The 95% confidence interval around the odds ratios are also presented.

Interpretation of Odds Ratios

The coefficients returned by our logit model are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as

\[ \% \text{ Change in Odds} = 100(OR - 1) \]

For example, the odds of voting for Trump are \(100(1.427 - 1) = 42.7\%\) higher for males compared to females. In addition, each increase on the education scale leads to a \(100(.777 - 1) = -22.3\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.013 - 1) = 1.3\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).

Interpretation in Terms of Predicted Probabilities

Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:

\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]

It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent variable having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males and females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.

We can look at predicted probabilities using a combination of windows and syntax. Begin by fitting the regression model. This time, go to Analyze \(\rightarrow\) Generalized Linear Models \(\rightarrow\) Generalized Linear Models…. It is necessary to use the Generalized Linear Models command because the Logistic command does not support syntax for requesting predicted probabilities.

Select Binary Logistic for Type of Model.

For Response, select vote as the dependent variable. SPSS will default to treating the higher category as the reference. The data are coded so that Clinton = 1 and Trump = 2, which means that the default will be to estimate the log odds of voting for Clinton. Our preference is to interpret the model in terms of the odds of voting for Trump, which makes it necessary to change the default. This can be done by clicking Reference Category.

Select First (lowest value) as the reference category, then click Continue.

For Predictors, select age and educ as covariates. Select gender as a factor (categorical) variable. SPSS will automatically create dummy variables for any variable specified as a factor, defaulting to the lowest value as the reference. The data are coded such that 1 = Male and 2 = Female, which means that Male is the reference. (This can be changed in the Options setting.)

In the Model tab, add each covariate, age, gender, and educ as main effects to the model.

Finally, in the Statistics tab, check the box to include exponential parameter estimates. This requests that odds ratios will be reported in the output. Then click Paste.

This will paste the syntax into a new syntax window.

Add the following line of code:

/EMMEANS TABLES = gender control = age (35) educ (4).

This requests that SPSS return a table with the predicted probabilities for males and females, holding age constant at 35 and education constant at 4 (college degree).

The complete syntax will be:

GENLIN vote (REFERENCE=FIRST) BY gender (ORDER=DESCENDING) WITH educ age
  /MODEL gender educ age INTERCEPT=YES
 DISTRIBUTION=BINOMIAL LINK=LOGIT
  /CRITERIA METHOD=FISHER(1) SCALE=1 COVB=MODEL MAXITERATIONS=100 MAXSTEPHALVING=5 
    PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD) CILEVEL=95 CITYPE=WALD 
    LIKELIHOOD=FULL
  /MISSING CLASSMISSING=EXCLUDE
  /PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION (EXPONENTIATED)
  /EMMEANS TABLES = gender control = age (35) educ (4).

Then select everything and run. We will get the following output:

The first four tables give descriptive information about the variables in the model.

The next table presents the value of the likelihood function at its optimum as well as different statistics based on the likelihood value. These are typically used to compare different models and thus are not relevant here.

The omnibus test is a test that the model as a whole is significant (that is, that gender, age, and education jointly have a significant effect). It will generally be significant if at least one of the predictors is significant, which is the case for this model. We find that gender, age, and educ all have significant results.

Note that Test of Model Effects will display the same p-values as the Parameter Estimates table below except for cases when a factor variable has more than two levels. For categorical variables with 3 or more levels, the Test of Model Effects will report whether all of the dummy indicators for that factor are jointly significant.

Next, we get the logit table.

In this table:

  • B is the change in the log odds given a unit increase in the variable
  • Std. Error gives the standard error for B
  • Lower and Upper give the 95% Wald confidence interval for B
  • Wald Chi-Square will be the same as the prior table except for factor variables with more than two levels
  • df gives the chi-squared test degrees of freedom
  • Sig. gives the p-value from that test
  • Exp(B) is the results in terms of odds ratios
  • Lower and Upper give the 95% Wald confidence interval for the odds ratios

Finally, the predicted probabilities table:

The values in the Mean column are the predicted probabilities for males or females holding age constant at 35 and education constant at 4 (college degree). The delta-method standard errors provide a measure of uncertainty around the estimates. The 95% confidence interval is useful for understanding how much uncertainty we have in our predicted probabilities. The probability that a 35-year-old, college-educated male votes for Trump is .43, 95% CI = [.40, .47], and the probability that a 35-year-old, college-educated female votes for Trump is .35, 95% CI = [.31, .38].