Simple Regression in R

Posted on
R regression

This tutorial shows how to fit a simple regression model (that is, a linear regression with a single independent variable) using R. The details of the underlying calculations can be found in our simple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in R format (.rds) can be downloaded from our github repo.

In this example, we will assess the relationship between the percentage of social media posts that mention a Congressional candidate and how well the candidates did in the next election. The variables of interest are:

  • vote_share (dependent variable): The percent of votes for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate

Both variables are measured as percentages ranging from zero to 100.

We will begin by loading the packages we will use:

library(tidyverse)
library(knitr)
library(readr)

Next we will load the data. We use the drop_na function to drop any cases with missing values, and the select function from dplyr to keep only the variables of interest.

twitter_data <- read_rds("data/twitter_data.rds") %>%
  drop_na(vote_share, mshare) %>%
  select(vote_share, mshare)

Data Visualization

It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the vote_share variable, our outcome of interest.

twitter_data %>%
  ggplot(aes(x=vote_share)) +
  geom_histogram(aes(y = ..density..),
                 color = 'black', 
                 fill = 'firebrick') +
  geom_line(stat="density") +
  labs(x ="Vote Share",  y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R Histogram and Kernel Density Plot for Vote Share

We start off with a histogram from geom_histogram. ..density.. tells R to plot the density of each bin; the default of geom_histogram is to plot the count of each value, but we changed this to superimpose a normal density curve on top. The color and fill are specified to make the graph nicer. Next, we add the kernel density line, which is a smoothed version of the histogram, using geom_line(stat="density"). Finally, we label our variables so that the figure displays informative labels rather than the variable name. The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.

We can do the same thing for our independent variable.

twitter_data %>%
  ggplot(aes(x=mshare)) +
  geom_histogram(aes(y = ..density..),
                 color = 'black', 
                 fill = 'firebrick') +
  geom_line(stat="density") +
  labs(x = "Tweet Share", y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R Histogram and Kernel Density Plot for Tweet Share

We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.

It is also helpful to look at the bivariate association between the two variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. The syntax to use is the following.

twitter_data %>%
  ggplot(aes(x=mshare, y = vote_share)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(x = "Tweet Share", y = "Vote Share")
## `geom_smooth()` using formula 'y ~ x'

Stata Scatterplot and Linear Fit

Here we are looking at a scatterplot of our observations, and we’ve also requested the best linear fit (i.e. the regression line) to better see the positive relationship. The method = 'lm' option creates the liner model. The default formula is y~x or, vote share by tweet share. The default is for geom_smooth to include a 95% confidence interval, which is the shaded area. There is a clear, positive association between these variables.

Running the Regression

The following syntax runs the regression.

linearmod <-lm(vote_share ~ mshare, twitter_data)

This returns:

summary(linearmod) 
## 
## Call:
## lm(formula = vote_share ~ mshare, data = twitter_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.278  -8.477   1.792   8.554  44.480 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.0424     1.3451   27.54   <2e-16 ***
## mshare        0.2685     0.0226   11.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.61 on 404 degrees of freedom
## Multiple R-squared:  0.2589, Adjusted R-squared:  0.2571 
## F-statistic: 141.2 on 1 and 404 DF,  p-value: < 2.2e-16

First we have the Call, which is the code used to generate the results.

Next are the residuals, or the distance between each predicted value on the dependent variable and the actual observed value. The minimum, 1st quartile, median, 3rd quartile and maximum are presented.

Then we see the results of the regression model. The intercept in the simple regression equation is the vote share we expect when Tweet share equals zero. Here we see that the predicted value is 37.04, which coincides with what we saw above in the scatterplot. The estimated constant value is significantly different from zero, \(p < 0.001\), though this test is of less interest to us compared to assessing the significance of the regression line slope. The slope of the regression line is 0.269. This means that for each increase of one on the mshare variable, the vote share increases by 0.269. The standard error tells us how much sample-to-sample variability we should expect. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that the mshare coefficient estimate is easily significant, \(p < 0.001\).

We see the residual standard error, which is used in determining model fit, is 14.61. The R-squared value tells us that the independent variable explains 25.89% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 25.71%. Finally, the \(F\)-statistic tests the null hypothesis that the independent variable does not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by p-value: < 2.2e-16.

Fun Facts about Simple Regression

In a simple regression only (that is, when there is just a single independent variable), the \(R^2\) is exactly equal to squaring the Pearson correlation between the two variables. To see this, run:

cor(twitter_data$mshare, twitter_data$vote_share)
## [1] 0.5088673

The correlation between Tweet share and vote share is 0.5089. If we square this, we get

\[ 0.5089^2 = 0.2589, \]

which is the same as the \(R^2\) value from the regression.

Also, in simple regression only, the model \(F\)-test is the same as the test for the single independent variable. A \(t\)-statistic with \(k\) degrees of freedom is equal to an \(F\)-statistic with 1 and \(k\) degrees of freedom. When there are no other predictors in the model, the square root of \(F\) will equal the \(t\) for our coefficient,

\[ \sqrt{141.17} = 11.88. \]

For more detailed information on where these numbers come from, consult our simple regression tutorial.