How to Estimate a Simple Linear Regression Model

Jeremy Albright

Posted on
regression correlation r-squared

Regression is a basic method for predicting values of some dependent variable $$(Y)$$ as a function of one or more independent variables $$(X_i)$$. Simple regression describes the case when there is only one predictor, whereas multiple regression has multiple predictors. This tutorial will focus solely on simple regression.

The data used in this tutorial are from the article More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior from DiGrazia, McKelvey, Bollen, and Rojas (2013). This study inspected the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.

The variables used in this tutorial are the following:

• vote_share (dependent variable): The percent of votes for a Republican candidate
• mshare (independent variable): The percent of social media posts for a Republican candidate

There are 406 observations in the data. Take a look at the first six rows:

Vote Share Tweet Share
51.09377 26.26263
59.48065 0.00000
57.93567 65.07937
27.51530 33.33333
69.32448 79.41176
53.20280 40.32258

Correlation

Calculation of Pearson’s $$r$$

There is a close relationship between simple regression and a correlation. In fact, the $$p$$-value estimates for a correlation (Pearson’s r) will be the same as the $$p$$-value for the simple regression estimate. We will estimate both.

All of the observations can be visualized with a scatterplot.

Significance Tests for Correlation

We want to determine whether the $$r$$ value we found is actually significant based on a $$t$$-statistic, the calculation of which requires first determining the standard error for the correlation. When the null hypothesis is $$r=0$$, we can use the following formula:

$SE_r=\sqrt{\frac{1-r^2}{n-2}}$

The formula for the $$t$$-statistic is the following

$t = \frac{r}{SE_r} = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$

Plugging in our values, the standard error is:

$SE_r = \sqrt{\frac{1-0.509^2}{406-2}}=0.0428$

The $$t$$-statistic is thus:

$t = \frac{0.509}{.0428} = 11.89$

We can compare that to a $$t$$-distribution with $$df=404$$ and get a $$p$$ value of $$\lt .001$$. We therefore reject the null hypothesis that $$r=0$$.

Correlations give us a sense of the magnitude of a relationship on a scale that always ranges from -1 to +1. However, it does not help much if we want to make a prediction for $$y$$ on the basis of a value of $$x$$. For that, we need to use regression.

Simple Regression Model

We can make a prediction for the value of $$y$$, which we denote $$\hat{y}$$ (pronounced “y-hat”), on the basis of the following regression equation:

$\hat{y} = a + b x$

Here $$b$$ represents the estimate of how much $$y$$ changes for a one-unit increase in $$x$$. $$a$$, the intercept, is the estimated value of $$y$$ if $$x=0$$. Note that this is often not a relevant value. For example, if height were the independent variable, one would not expect to encounter someone of height zero. $$a$$ is only interpreted when zero is a meaningful value for $$x$$.

How can we get the right values to use for $$a$$ and $$b$$ in the model? The method of least squares is used to calculate the regression equation. This uses the error of each point (the distance from observation to regression line) to find values that minimize that error. The following figure illustrates the definition of the error (also called the residual).