Multiple Regression in R

Posted on
R multiple regression

This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using R. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in R format (.rds) can be downloaded from our github repo.

The variables of interest are:

  • vote_share (dependent variable): The percent of votes for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate
  • pct_white (independent variable): The percent of white voters in a given Congressional district

All three variables are measured as percentages ranging from zero to 100.

We will begin by loading the packages we will use:

library(tidyverse)
library(knitr)
library(readr)

Next we will load the data. We use the select function from dplyr to keep only the variables of interest.

twitter_data <- read_rds("data/twitter_data.rds") %>%
  select(vote_share, mshare, pct_white)

Data Visualization

It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the vote_share variable, our outcome of interest.

twitter_data %>%
  ggplot(aes(x=vote_share)) +
  geom_histogram(aes(y = ..density..),
                 color = 'black', 
                 fill = 'firebrick') +
  geom_line(stat="density") +
  labs(x ="Vote Share",  y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R Histogram and Kernel Density Plot for Vote Share

We start off with a histogram from geom_histogram. ..density.. tells R to plot the density of each bin; the default of geom_histogram is to plot the count of each value, but we changed this to superimpose a normal density curve on top. The color and fill are specified to make the graph nicer. Next, we add the kernel density line, which is a smoothed version of the histogram, using geom_line(stat="density"). Finally, we label our variables so that the figure displays informative labels rather than the variable name. The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.

We can do the same thing for our tweet share and percent white variables.

twitter_data %>%
  ggplot(aes(x=mshare)) +
  geom_histogram(aes(y = ..density..),
                 color = 'black', 
                 fill = 'firebrick') +
  geom_line(stat="density") +
  labs(x = "Tweet Share", y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R Histogram and Kernel Density Plot for Tweet Share

We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.

The following figure shows the distribution of the percentage white variable.

twitter_data %>%
  ggplot(aes(x=pct_white)) +
  geom_histogram(aes(y = ..density..),
                 color = 'black', 
                 fill = 'firebrick') +
  geom_line(stat="density") +
  labs(x = "Percent White", y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R Histogram and Kernel Density Plot for Percent White

Again, the values fall in the range we’d expect. There is a negative skew in the distribution.

It is also helpful to look at the bivariate association between the variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. The syntax to use is the following.

twitter_data %>%
  ggplot(aes(x=mshare, y = vote_share)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(x = "Tweet Share", y = "Vote Share")
## `geom_smooth()` using formula 'y ~ x'

R Scatterplot and Linear Fit

twitter_data %>%
  ggplot(aes(x=pct_white, y = vote_share)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(x = "Percent White", y = "Vote Share")
## `geom_smooth()` using formula 'y ~ x'

R Scatterplot and Linear Fit

Here we are looking at a scatterplot of our observations, and we’ve also requested the best linear fit (i.e. the regression line) to better see the positive relationship. The method = 'lm' option creates the linear model. The default formula is y~x. The default is for geom_smooth to include a 95% confidence interval, which is the shaded area. There is a clear, positive association between these variables.

Running the Regression

The following syntax runs the regression.

linearmod <-lm(vote_share ~ mshare + pct_white, twitter_data)

We use the lm command to run the regression. We define the formula as y~x1 + x2, and then we specify the dataset that the variables are in.

This returns:

summary(linearmod) 
## 
## Call:
## lm(formula = vote_share ~ mshare + pct_white, data = twitter_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.813  -7.745  -0.135   7.573  31.502 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.86497    2.44885   0.353    0.724    
## mshare       0.17836    0.01840   9.693   <2e-16 ***
## pct_white    0.55008    0.03368  16.334   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 403 degrees of freedom
##   (29 observations deleted due to missingness)
## Multiple R-squared:  0.5541, Adjusted R-squared:  0.5519 
## F-statistic: 250.4 on 2 and 403 DF,  p-value: < 2.2e-16

First we have the Call, which is the code used to generate the results.

Next are the residuals, or the distance between each predicted value on the dependent variable and the actual observed value. The minimum, 1st quartile, median, 3rd quartile and maximum are presented.

We then see the results of the regression model. The intercept in the multiple regression equation is the vote share we expect when Tweet share and percent white both equal zero. Here we see that the predicted value is 0.865. The estimated constant value is not significantly different from zero, \(p = 0.724\), though this test is of less interest to us compared to assessing the significance of the independent variable estimates. The estimate for mshare is 0.178. This means that for each increase of one on the mshare variable, the vote share increases by 0.178, holding percent white constant. The estimate for pct_white is 0.55. This means that for each increase of one on the pct_white variable, the vote share increases by 0.55, holding tweet share constant. The standard error tells us how much sample-to-sample variability we should expect in the coefficient estimates. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that the mshare and pct_white coefficient estimates are easily significant, \(p < 0.001\).

At the bottom of the output we see the residual standard error, which is used in determining model fit, is 11.35. The R-squared value tells us that the independent variable explains 55.41% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 55.19%. Finally, the \(F\)-statistic tests the null hypothesis that the independent variables together do not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by p-value: < 2.2e-16.

Still have questions? Contact us!