This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using R. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the *More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior* study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in R format (.rds) can be downloaded from our github repo.

The variables of interest are:

`vote_share`

(*dependent variable*): The percent of votes for a Republican candidate`mshare`

(*independent variable*): The percent of social media posts for a Republican candidate`pct_white`

(*independent variable*): The percent of white voters in a given Congressional district

All three variables are measured as percentages ranging from zero to 100.

We will begin by loading the packages we will use:

```
library(tidyverse)
library(knitr)
library(readr)
```

Next we will load the data. We use the `select`

function from `dplyr`

to keep only the variables of interest.

```
twitter_data <- read_rds("data/twitter_data.rds") %>%
select(vote_share, mshare, pct_white)
```

## Data Visualization

It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the `vote_share`

variable, our outcome of interest.

```
twitter_data %>%
ggplot(aes(x=vote_share)) +
geom_histogram(aes(y = ..density..),
color = 'black',
fill = 'firebrick') +
geom_line(stat="density") +
labs(x ="Vote Share", y = "Density")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

We start off with a histogram from `geom_histogram`

. `..density..`

tells R to plot the density of each bin; the default of `geom_histogram`

is to plot the count of each value, but we changed this to superimpose a normal density curve on top. The color and fill are specified to make the graph nicer. Next, we add the kernel density line, which is a smoothed version of the histogram, using `geom_line(stat="density")`

. Finally, we label our variables so that the figure displays informative labels rather than the variable name. The variable’s values (x-axis) fall within the range we expect. There is some negative skew in the distribution.

We can do the same thing for our tweet share and percent white variables.

```
twitter_data %>%
ggplot(aes(x=mshare)) +
geom_histogram(aes(y = ..density..),
color = 'black',
fill = 'firebrick') +
geom_line(stat="density") +
labs(x = "Tweet Share", y = "Density")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

We again see that the values fall into the range we expect. Note that there are also spikes at zero and 100. These indicate races where a single candidate received either all of the share of Tweets or none of the share of Tweets.

The following figure shows the distribution of the percentage white variable.

```
twitter_data %>%
ggplot(aes(x=pct_white)) +
geom_histogram(aes(y = ..density..),
color = 'black',
fill = 'firebrick') +
geom_line(stat="density") +
labs(x = "Percent White", y = "Density")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

Again, the values fall in the range we’d expect. There is a negative skew in the distribution.

It is also helpful to look at the bivariate association between the variables. This allows us to see whether there is visual evidence of a relationship, which will help us assess whether the regression results we ultimately get make sense given what we see in the data. The syntax to use is the following.

```
twitter_data %>%
ggplot(aes(x=mshare, y = vote_share)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(x = "Tweet Share", y = "Vote Share")
```

`## `geom_smooth()` using formula 'y ~ x'`

```
twitter_data %>%
ggplot(aes(x=pct_white, y = vote_share)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(x = "Percent White", y = "Vote Share")
```

`## `geom_smooth()` using formula 'y ~ x'`

Here we are looking at a scatterplot of our observations, and we’ve also requested the best linear fit (i.e. the regression line) to better see the positive relationship. The `method = 'lm'`

option creates the linear model. The default formula is `y~x`

. The default is for `geom_smooth`

to include a 95% confidence interval, which is the shaded area. There is a clear, positive association between these variables.

## Running the Regression

The following syntax runs the regression.

`linearmod <-lm(vote_share ~ mshare + pct_white, twitter_data)`

We use the `lm`

command to run the regression. We define the formula as `y~x1 + x2`

, and then we specify the dataset that the variables are in.

This returns:

`summary(linearmod) `

```
##
## Call:
## lm(formula = vote_share ~ mshare + pct_white, data = twitter_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.813 -7.745 -0.135 7.573 31.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86497 2.44885 0.353 0.724
## mshare 0.17836 0.01840 9.693 <2e-16 ***
## pct_white 0.55008 0.03368 16.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 403 degrees of freedom
## (29 observations deleted due to missingness)
## Multiple R-squared: 0.5541, Adjusted R-squared: 0.5519
## F-statistic: 250.4 on 2 and 403 DF, p-value: < 2.2e-16
```

First we have the `Call`

, which is the code used to generate the results.

Next are the residuals, or the distance between each predicted value on the dependent variable and the actual observed value. The minimum, 1st quartile, median, 3rd quartile and maximum are presented.

We then see the results of the regression model. The intercept in the multiple regression equation is the vote share we expect when Tweet share and percent white both equal zero. Here we see that the predicted value is 0.865. The estimated constant value is not significantly different from zero, \(p = 0.724\), though this test is of less interest to us compared to assessing the significance of the independent variable estimates. The estimate for `mshare`

is 0.178. This means that for each increase of one on the `mshare`

variable, the vote share increases by 0.178, holding percent white constant. The estimate for `pct_white`

is 0.55. This means that for each increase of one on the `pct_white`

variable, the vote share increases by 0.55, holding tweet share constant. The standard error tells us how much sample-to-sample variability we should expect in the coefficient estimates. Dividing the coefficient by the standard error gives us the \(t\)-statistic used to calculate the \(p\)-value. Here we see that the `mshare`

and `pct_white`

coefficient estimates are easily significant, \(p < 0.001\).

At the bottom of the output we see the residual standard error, which is used in determining model fit, is 11.35. The `R-squared`

value tells us that the independent variable explains 55.41% of the variation in the outcome. The adjusted \(R^2\) provides a slightly more conservative estimate of the percentage of variance explained, 55.19%.
Finally, the \(F\)-statistic tests the null hypothesis that the independent variables together do not help explain any variance in the outcome. We clearly reject the null hypothesis with \(p < 0.001\), as seen by `p-value: < 2.2e-16`

.

Still have questions? Contact us!