This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using SPSS. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in SPSS format can be downloaded from our github repo.
The variables of interest are:
vote_share
(dependent variable): The percent of votes for a Republican candidatemshare
(independent variable): The percent of social media posts for a Republican candidatepct_white
(independent variable): The percent of white voters in a given Congressional district
All three variables are measured as percentages ranging from zero to 100.
We can run the following line of syntax to delete all other variables.
DELETE VARIABLES
eshare to median_age pct_college to mccain_tert.
Data Visualization
It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. Go to Graphs \(\rightarrow\) Chart Builder…
Then select Simple Histogram as chart type, and click and drag vote_share
to the x-axis.
We can clean up the x-axis label in Element Properties on the right hand side.
Then click OK.
This creates the following figure:
The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.
We can do the same thing for our tweet share and percent white variables and get the following figures:
We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.
The following figure shows the distribution of the percentage white variable.
Again, the values fall in the range we’d expect. There is a negative skew in the distribution.
It is also helpful to look at the bivariate association between the variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. We will do this using chart builder and selecting Simple Scatter with Fit Line. This will return a scatterplot of the variables along with the best linear fit (i.e. the regression line) to better see the positive relationship.
Then click OK.
We do the same thing for the percent white variable and get the following plot:
There is a clear, positive association between these variables.
Running the Regression
To run the regression, go to Analyze \(\rightarrow\) Regression \(\rightarrow\) Linear…
Select vote_share
as the dependent variable and mshare
and pct_white
as the independent variables. Then click OK.
We get the following output:
The first table lists the variables in the model.
The second table provides the model summary. The \(R\) value is given, though the \(R^2\) value is more commonly used in interpretation. The R square
value tells us that the independent variable explains 55.4% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 55.2%. The Std. Error of the Estimate
gives a summary of how much the observed values vary around the predicted values, with better models having lower standard errors.
The third table provides us with an ANOVA table that gives 1) the sum of squares for the regression model, 2) the residual sum of squares, and 3) the total sum of squares. Dividing the Sum of Squares
column by the df
(degrees of freedom) column returns the mean squares in the Mean Square
column. These values go into calculating the \(R^2\), adjusted \(R^2\), and Standard Error of the Estimate shown in the previous table. The \(F\)-statistic tests the null hypothesis that the independent variables together do not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by Sig. = 0.000
.
The final table gives us the results of the regression model. The Unstandardized B
gives the coefficients used in the regression equation. The (Constant)
line is the estimate for the intercept in the multiple regression equation. This is the vote share we expect when Tweet share and percent white both equal zero. Here we see that the predicted value is 0.865. This value is of less interest to us compared to assessing the coefficients for mshare
and pct_white
. We can see that for each increase of one on the mshare
variable, the vote share increases by 0.178, holding percent white constant. For each increase of one on the percent white variable, the vote share increases by 0.55, holding tweet share constant.
The Coefficients Std. Error
tells us how much sample-to-sample variability we should expect. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that both the mshare
and pct_white
coefficient estimates are easily significant, \(p < 0.001\), while the (Constant)
is not, \(p=0.865\). In this application we don’t especially care about the constant. The standardized coefficients give us the association between the independent variables and dependent variable in standard deviation units. A one standard deviation increase in mshare
is associated with a change of 0.338 standard deviations in vote_share
, holding percent white constant.
Still have questions? Contact us!