Multiple Regression in SAS

Posted on
SAS multiple regression

This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using SAS. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in SAS format can be downloaded from our github repo.

The variables of interest are:

  • vote_share (dependent variable): The percent of votes for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate
  • pct_white (independent variable): The percent of white voters in a given Congressional district

All three variables are measured as percentages ranging from zero to 100.

Data Visualization

It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the vote_share variable, our outcome of interest.

proc sgplot data = twitter_data;
histogram vote_share;
density vote_share / type=kernel;
title "Vote Share Distribution";
xaxis label="Vote Share";
xaxis min = 0 max = 100;

SAS Histogram and Kernel Density Plot for Vote Share

We start off with the proc sgplot command, which tells SAS to use the sgplot procedure. We set the data as twitter_data. Next, we use the histogram call to create the histogram of vote_share; the default for the y-axis is percent. The density option provides a density line, and the type=kernel specifies that it should be a kernel density, which is a smoothed version of the histogram. We provide a chart title, and an xaxis label so that the figure displays informative labels rather than the variable name. The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.

We can do the same thing for both of our independent variables.

proc sgplot data = twitter_data;
histogram mshare;
density mshare / type=kernel;
title "Tweet Share Distribution";
xaxis label="Tweet Share";
xaxis min = 0 max = 100;

SAS Histogram and Kernel Density Plot for Tweet Share

We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.

proc sgplot data = twitter_data;
histogram pct_white;
density pct_white / type=kernel;
title "Percent White Distribution";
xaxis label="Percent White";
xaxis min = 0 max = 100;

The following figure shows the distribution of the percentage white variable.

SAS Histogram and Kernel Density Plot for Percent White

Again, the values fall in the range we would expect. There is a negative skew in the distribution.

It is also helpful to look at the bivariate association between the dependent variable and each of the independent variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. The syntax to use is the following.

proc sgplot data = twitter_data;
reg x = mshare y = vote_share;
title "Scatterplot with Linear Fit";
xaxis label = "Tweet Share";
yaxis label = "Vote Share";

SAS Scatterplot and Linear Fit

proc sgplot data = twitter_data;
reg x = pct_white y = vote_share;
title "Scatterplot with Linear Fit";
xaxis label = "Percent White";
yaxis label = "Vote Share";

SAS Scatterplot and Linear Fit

Here we use the reg command to look at a scatterplot of our observations with the best linear fit (i.e. the regression line) to better see the positive relationship. There is a clear, positive association between these variables.

Running the Regression

The following syntax runs the regression.

proc reg data=twitter_data;
title "Linear Regression Model";
model vote_share = mshare pct_white;

This returns quite a few tables and figures.

SAS Regression Tables

The first table tells us that there were 406 observations in the data, and all 406 were used in the analysis.

The next table provides us with an ANOVA table that gives 1) the sum of squares for the model, often called the regression sum of squares, 2) the error, or residual, sum of squares, and 3) the total sum of squares. Dividing the Sum of Squares column by the DF (degrees of freedom) column returns the mean squares in the Mean Square column. These values go into calculating the \(F\)-statistic, which is 250.42. The \(F\)-statistic tests the null hypothesis that the independent variables together do not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by the Pr > F value.

Looking at the next table, the R-squared value tells us that the independent variable explains 55.41% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 55.19%. The Root MSE is the square root of the residual MS from the previous table, \(\sqrt{128.81} = 11.35\). This value gives a summary of how much the observed values vary around the predicted values, with better models having lower RMSEs. It is also used in the formula for the standard error of the coefficient estimates, shown in the next table. The table also displays the dependent variable’s mean and the coefficient of variable (the root MSE divided by the mean and multiplied by 100).

The final table tells us the results of the regression model. The intercept in the multiple regression equation is the vote share we expect when Tweet share and percent white both equal zero. Here we see that the predicted value is 0.865. The estimated constant value is not significantly different from zero, \(p = 0.724\), though this test is of less interest to us compared to assessing the significance of the independent variable estimates. The estimate for mshare is 0.178. This means that for each increase of one on the mshare variable, the vote share increases by 0.178, holding percent white constant. The estimate for pct_white is 0.55. This means that for each increase of one on the pct_white variable, the vote share increases by 0.55, holding tweet share constant. The standard error tells us how much sample-to-sample variability we should expect in the coefficient estimates. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that the mshare and pct_white coefficient estimates are easily significant, \(p < 0.001\).

SAS also creates fit diagnostics figures, but we will not cover those here. For more information on assessing model fit, see our tutorial here.