This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using Stata. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in Stata format can be downloaded from our github repo.
The variables of interest are:
vote_share
(dependent variable): The percent of votes for a Republican candidatemshare
(independent variable): The percent of social media posts for a Republican candidatepct_white
(independent variable): The percent of white voters in a given Congressional district
All three variables are measured as percentages ranging from zero to 100.
Data Visualization
It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the vote_share
variable, our outcome of interest.
label var vote_share "Vote Share"
label var mshare "Tweet Share"
label var pct_white "Percent White"
histogram vote_share, freq kdensity
We start off labeling our variables so that the figure displays informative labels rather than the variable name. The freq
option requests that the y-axis show the frequency with which the binned values occur in the data, the default is to show the density. The kdensity
option provides a kernel density line, which is a smoothed version of the histogram. The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.
We can do the same thing for both of our independent variables.
histogram mshare, freq kdensity
We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.
histogram pct_white, freq kdensity
The following figure shows the distribution of the percent white variable.
Again, the values fall in the range we would expect. There is a negative skew in the distribution.
It is also helpful to look at the bivariate association between the dependent variable and each of the independent variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. The syntax to use is the following.
graph twoway (scatter vote_share mshare, msize(vsmall)) ///
(lfit vote_share mshare), xtitle("Tweet Share") ytitle("Vote Share")
graph twoway (scatter vote_share pct_white, msize(vsmall)) ///
(lfit vote_share mshare), xtitle("Percent White") ytitle("Vote Share")
Here we are looking at a scatterplot of our observations, and we’ve also requested the best linear fit (i.e. the regression line) to better see the positive relationship. The msize
option makes the dots in the figure “very small”, which arguably looks a little nicer given the number of observations. The ///
allows the code to span multiple lines if we are running this from a do file. There is a clear, positive association between these variables.
Running the Regression
The following syntax runs the regression.
regress vote_share mshare pct_white
This returns the following:
Source | SS df MS Number of obs = 406
-------------+---------------------------------- F(2, 403) = 250.42
Model | 64511.7549 2 32255.8775 Prob > F = 0.0000
Residual | 51908.7685 403 128.805877 R-squared = 0.5541
-------------+---------------------------------- Adj R-squared = 0.5519
Total | 116420.523 405 287.458083 Root MSE = 11.349
------------------------------------------------------------------------------
vote_share | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mshare | .1783636 .0184005 9.69 0.000 .1421906 .2145367
pct_white | .5500785 .033677 16.33 0.000 .4838739 .616283
_cons | .8649705 2.448847 0.35 0.724 -3.949139 5.67908
------------------------------------------------------------------------------
The box at the top left provides us with an ANOVA table that gives 1) the sum of squares (SS
) for the model, often called the regression sum of squares, 2) the residual sum of squares, and 3) the total sum of squares. Dividing the SS
column by the df
(degrees of freedom) column returns the mean squares in the MS
column. These values go into calculating the \(F\)-statistic, \(R^2\), adjusted \(R^2\), and Root Mean Square Error shown in the top right of the output.
Looking at the top right, we see that the number of observations used to fit the model was 406. The \(F\)-statistic tests the null hypothesis that the independent variables together do not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by Prob > F = 0.0000
. The R-squared
value tells us that the independent variables explain 55.41% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 55.19%. The Root MSE
is the square root of the residual MS from the top left table, \(\sqrt{128.805877} = 11.349\). This value gives a summary of how much the observed values vary around the predicted values, with better models having lower RMSEs. It is also used in the formula for the standard error of the coefficient estimates, shown in the next table.
The final table tells us the results of the regression model. The estimate for mshare
is 0.178. This means that for each increase of one on the mshare
variable, the vote share increases by 0.178, holding percent white constant. The estimate for pct_white
is 0.55. This means that for each increase of one on the pct_white
variable, the vote share increases by 0.55, holding tweet share constant. The standard error tells us how much sample-to-sample variability we should expect in the coefficient estimates. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that the mshare
and pct_white
coefficient estimates are easily significant, \(p < 0.001\). The 95% confidence interval for the coefficients are also presented.
The constant, _cons
, is the vote share we expect when Tweet share and percent white both equal zero. Here we see that the predicted value is 0.865, and has a 95% confidence interval of [-3.949, 5.679]. The estimated constant value is not significantly different from zero, \(p = 0.724\), though this test is of less interest to us compared to assessing the significance of the independent variable estimates.
Still have questions? Contact us!