# Multiple Regression in R

Posted on
R multiple regression

This tutorial shows how to fit a multiple regression model (that is, a linear regression with more than one independent variable) using R. The details of the underlying calculations can be found in our multiple regression tutorial. The data used in this post come from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The replication data in R format (.rds) can be downloaded from our github repo.

The variables of interest are:

• vote_share (dependent variable): The percent of votes for a Republican candidate
• mshare (independent variable): The percent of social media posts for a Republican candidate
• pct_white (independent variable): The percent of white voters in a given Congressional district

All three variables are measured as percentages ranging from zero to 100.

We will begin by loading the packages we will use:

library(tidyverse)
library(knitr)
library(readr)

Next we will load the data. We use the select function from dplyr to keep only the variables of interest.

twitter_data <- read_rds("data/twitter_data.rds") %>%
select(vote_share, mshare, pct_white)

## Data Visualization

It is always a good idea to begin any statistical modeling with a graphical assessment of the data. This allows you to quickly examine the distributions of the variables and check for possible outliers. The following code returns a histogram for the vote_share variable, our outcome of interest.

twitter_data %>%
ggplot(aes(x=vote_share)) +
geom_histogram(aes(y = ..density..),
color = 'black',
fill = 'firebrick') +
geom_line(stat="density") +
labs(x ="Vote Share",  y = "Density")
## stat_bin() using bins = 30. Pick better value with binwidth.