Logistic Regression in SAS

Nikki Kamouneh

Posted on
logit logisitic regression SAS

This post outlines the steps for performing a logistic regression in SAS. The data come from the 2016 American National Election Survey. Code for preparing the data can be found on our github page, and the cleaned data can be downloaded here.

The steps that will be covered are the following:

  1. Check variable codings and distributions
  2. Graphically review bivariate associations
  3. Fit the logit model
  4. Interpret results in terms of odds ratios
  5. Interpret results in terms of predicted probabilities

The variables we use will be:

  • vote: Whether the respondent voted for Clinton or Trump
  • gender: Male or female
  • age: The age (in years) of the respondent
  • educ: The highest level of education attained

For simplicity, this demonstration will ignore the complex survey variables (weight, PSU, and strata).

Univariate Summaries

We can assign labels to our data to make interpretation easier using the following syntax:

proc format;
  value votecode
    1 = 'Clinton'
    2 = 'Trump';
  value gendercode
    1 = 'Male'
    2 = 'Female';
  value educcode
    1 = 'HS Not Completed'
    2 = 'Completed HS'
    3 = 'College <4 Years'
    4= 'College 4 Year Degree'
    5 = 'Advanced Degree';
run;

data cleaned_anes;
  set cleaned_anes;
  format vote votecode. gender gendercode. educ educcode.;
run;

The first step in any statistical analysis should be to perform a visual inspection of the data in order to check for coding errors, outliers, or funky distributions. We can take a look at the frequencies of our categorical variables using:

proc freq data = cleaned_anes;
  tables vote gender educ;
run;

This will give us the following output:

SAS Frequencies

We can also check a summary of the distribution of age.

proc means data = cleaned_anes;
  var age;
run;

We get a table where N = 2,384 is the number of observations, the mean is 52, the standard deviation is 17.19, and the minimum and maximum ages are 18 and 90 respectively.

Summary of the distribution of age

Tables are useful, but often graphs are more informative. Bar graphs are the easiest for examining categorical variables. Start with the outcome variable.

proc sgplot data = cleaned_anes;
  vbar vote;
run;

Bar plot of Vote

In the sample, Clinton received more votes than Trump, but not by a large amount.

Now the categorical independent variables.

proc sgplot data = cleaned_anes;
  vbar gender;
run;

Bar plot of gender

We see that our sample has more females than males.

proc sgplot data = cleaned_anes;
  vbar educ;
run;

Bar plot of education

Within our sample, the modal respondent has some college, with the second most populated category being college educated.

For continuous variables, histograms allow us to determine the shape of the distribution and look for outliers. The following command produces the histogram.

proc sgplot data = cleaned_anes;
  histogram age;
run;

Histogram of Age

We now have a good sense as to what the distributions of all of our variables are and do not see any evidence that recodes are necessary.

Bivariate Summaries

Prior to moving on to the fully specified model, it is advisable to first examine the simple associations between the outcome and each individual predictor. Doing so can help avoid surprises in the final model. For example, if there is no simple relationship apparent in the data, we shouldn’t be taken aback when that predictor is not significant in the model. If there is a simple association, but it disappears in the full model, then we have evidence that one of the other variables is a confounder. Upon controlling for that factor, the relationship we initially observed is explained away.

Graphs are again helpful. When the outcome is categorical and the predictor is also categorical, a grouped bar graph is informative. The following is the graph of vote choice and gender.

proc sgplot data = cleaned_anes;
  vbar gender / group = vote groupdisplay = cluster;
run;

Bar plot of gender by vote

The figure shows that, within males, Trump support was higher. Within females, Clinton support was higher.

A similar figure can be made for education.

proc sgplot data = cleaned_anes;
  vbar educ / group = vote groupdisplay = cluster;
run;

Bar plot of education by vote

The figure suggests that Trump was favored by those with a high school diploma and some college, whereas Clinton’s support was higher with those who finished college and especially among those with an advanced degree. Although Clinton was slightly preferred among those without a high school diploma, the figure overall favors an interpretation that Clinton’s support increases with education.

Boxplots are useful for examining the association between a categorical variable and a variable measured on an interval scale.

proc sgplot data = cleaned_anes;
  hbox age / category = vote;
run;

Boxplot of age by vote

There’s a lot of overlap between the two boxes, though the Trump box sits a little higher than the Clinton box. The interpretation is that older respondents tend to be more likely to vote for Trump.

Having carefully reviewed the data, we can now move to estimating the model.

Fitting the Model

To fit a logistic regression in SAS, we will use the following code:

proc logistic data = cleaned_anes descending;
  class gender vote / param=glm;
  model vote = gender age educ;
run;

SAS will automatically create dummy variables for the variables we specified under class if the param option is set equal to either ref or glm. (Without specifying param, the default coding for two-level factor variables is -1, 1, rather than 0, 1 like we prefer). We opt for glm here so that we can later add lsmeans statements to the syntax to estimate predicted probabilities.

Note that educ is an ordered categorical variable, we opt here to treat its effect as linear. The last variable is age. The results of running this syntax are the following:

Analysis of Maximum Likelihood estimates, MLE

We find that gender, age, and educ have significant results.

We can also look at our results in terms of odds ratios. This is output automatically as well.

Odds Ratio Estimates

This will produce a table where:

  • Point Estimate is the odds ratio
  • 95% Wald Confidence Limits gives the lower and upper levels of the 95% confidence interval for the odds ratios

Note that the odds ratios are simply the exponentiated coefficients from the prior table.

Interpretation of Odds Ratios

The coefficients returned by our logit model are difficult to interpret intuitively, and hence it is common to report odds ratios instead. An odds ratio less than one means that an increase in \(x\) leads to a decrease in the odds that \(y = 1\). An odds ratio greater than one means that an increase in \(x\) leads to an increase in the odds that \(y = 1\). In general, the percent change in the odds given a one-unit change in the predictor can be determined as

\[ \% \text{ Change in Odds} = 100(OR - 1) \]

For example, the odds of voting for Trump are \(100(.701 - 1) = -29.9\%\) lower for females compared to males. In addition, each increase on the education scale leads to a \(100(.777 - 1) = -22.3\%\) decrease in the odds of voting for Trump. Finally, each one year increase in age leads to a \(100(1.013 - 1) = 1.3\%\) increase in the odds of voting for Trump. All of these are statistically significant at \(p < .05\).

Interpretation in Terms of Predicted Probabilities

Odds ratios are commonly reported, but they are still somewhat difficult to intuit given that an odds ratio requires four separate probabilities:

\[ \text{Odds Ratio} = \left(\frac{p(y = 1 \mid x + 1)}{p(y = 0 \mid x + 1)}\right)\bigg/ \left(\frac{p(y = 1 \mid x)}{p(y = 0 \mid x)}\right) \]

It’s much easier to think directly in terms of probabilities. However, due to the nonlinearity of the model, it is not possible to talk about a one-unit change in an independent variable having a constant effect on the probability. Instead, predicted probabilities require us to also take into account the other variables in the model. For example, the difference in the probability of voting for Trump between males and females may be different depending on if we are talking about educated voters in their 30s or uneducated voters in their 60s.

We can look at predicted probabilities using the following syntax.

proc logistic data = cleaned_anes descending;
  class gender / param=glm;
  model vote = gender age educ;
  lsmeans gender / at (age educ) = (35 4) ilink cl;
run;

Note that param = glm is necessary for lsmeans. lsmeans will give us predicted values for each level of gender, with the at option specifying that age is held constant at 35 and education is held constant at 4 (college degree). The ilink argument returns predictions on the probability scale, and cl requests confidence intervals.

We will get the following output:

Table of Least Squares Means

  • Estimate: the prediction on the logit scale.
  • Standard Error: predicted logit’s standard error.
  • z Value: tests if the prediction is significantly different from zero.
  • Pr > |z|: p-value from the z-statistic. Note that this isn’t a hypothesis we care about here, we only want the predicted probabilities and their confidence intervals from the final columns.
  • Lower and Upper: lower and upper limits of the 95% confidence interval of the prediction on the logit scale.
  • Mean: the predicted probabilities for males or females holding age constant at 35 and education constant at 4 (college degree).
  • Standard Error of Mean: the delta-method standard error of the predicted probability.
  • Lower Mean and Upper Mean: the 95% confidence interval around the predicted probability.

The probability that a 35-year-old, college-educated male votes for Trump is .43, 95% CI = [.40, .47], and the probability that a 35-year-old, college-educated female votes for Trump is .35, 95% CI = [.31, .38]. SAS automatically plots these predictions for us.

LS Means for Gender plot