Simple Regression

Jeremy Albright

Posted on
regression correlation r-squared

Regression is a basic method for predicting values of some dependent variable \((Y)\) as a function of one or more independent variables \((X_i)\). Simple regression describes the case when there is only one predictor, whereas multiple regression has multiple predictors. This tutorial will focus solely on simple regression.

The data used in this tutorial are from the article More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior from DiGrazia, McKelvey, Bollen, and Rojas (2013). This study inspected the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.

The variables used in this tutorial are the following:

  • vote_share (dependent variable): The percent of votes for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate

There are 406 observations in the data. Take a look at the first six rows:

Vote Share Tweet Share
51.09377 26.26263
59.48065 0.00000
57.93567 65.07937
27.51530 33.33333
69.32448 79.41176
53.20280 40.32258

Correlation

Calculation of Pearson’s \(r\)

There is a close relationship between simple regression and a correlation. In fact, the \(p\)-value estimates for a correlation (Pearson’s r) will be the same as the \(p\)-value for the simple regression estimate. We will estimate both.

All of the observations can be visualized with a scatterplot.

Scatterplot of Vote to Tweet Share

A correlation is represented by some value, here denoted as \(r\), which ranges between \(-1 \leq r \leq 1\). A value closer to zero represents a weak or non-existent relationship, while values close to \(\pm 1\) represent a strong relationship. Negative values represent an inverse relationship (\(y\) gets larger as \(x\) gets smaller), and positive values a positive relationship (\(y\) gets larger as \(x\) gets larger). In the above figure, there is a tendency for vote share to increase as the share of tweets received by the candidates increased. How strong is the relationship?

The formula to find \(r\) is given by:

\[ r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}} \]

To illustrate how this is calculated, we will perform the steps that go into the formula and display them for the first six observations. The independent variable is Tweet share and will be referenced with \(x_i\), where the subscript means the value for the \(i_{th}\) observation. The dependent variable is vote share and will be denoted by \(y_i\). The notation \(\bar{x}\), read “x-bar”, refers to the mean value of \(x\) across all observations (including those not in the table). \(\bar{y}\) is similarly interpreted for the dependent variables.

\(x\) \(y\) \(\bar{x}\) \(\bar{y}\) \((x_i-\bar{x})\) \((y_i-\bar{y})\) \((x_i-\bar{x})^2\) \((y_i-\bar{y})^2\) \((x_i-\bar{x})(y_i-\bar{y})\)
26.26 51.09 50.12 50.5 -23.86 0.59 569.22 0.35 -14.13
0.00 59.48 50.12 50.5 -50.12 8.98 2512.10 80.62 -450.04
65.08 57.94 50.12 50.5 14.96 7.43 223.76 55.27 111.20
33.33 27.52 50.12 50.5 -16.79 -22.99 281.82 528.37 385.88
79.41 69.32 50.12 50.5 29.29 18.82 857.96 354.30 551.34
40.32 53.20 50.12 50.5 -9.80 2.70 96.01 7.30 -26.47

The formula for \(r\) requires the sum of some of these calculations, which means adding up all of the values in the respective column. Doing so for the entire data set, we would find:

\(\sum(x_i-\bar{x})(y_i-\bar{y})\) \(\sum(x_i-\bar{x})^2\) \(\sum(y_i-\bar{y})^2\)
112,263.4 418,059.5 116,420.5

\[ r = \frac{112,263.4}{\sqrt{418,059.5*116,420.5}} = 0.509 \]

That is, the correlation between Tweet share and vote share is .509, which is a fairly strong association. Can we reject the null hypothesis that the true value is zero (that is, that there is no association)?

Significance Tests for Correlation

We want to determine whether the \(r\) value we found is actually significant based on a \(t\)-statistic, the calculation of which requires first determining the standard error for the correlation. When the null hypothesis is \(r=0\), we can use the following formula:

\[ SE_r=\sqrt{\frac{1-r^2}{n-2}} \]

The formula for the \(t\)-statistic is the following

\[ t = \frac{r}{SE_r} = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \]

Plugging in our values, the standard error is:

\[ SE_r = \sqrt{\frac{1-0.509^2}{406-2}}=0.0428 \]

The \(t\)-statistic is thus:

\[ t = \frac{0.509}{.0428} = 11.89 \]

We can compare that to a \(t\)-distribution with \(df=404\) and get a \(p\) value of \(\lt .001\). We therefore reject the null hypothesis that \(r=0\).

Correlations give us a sense of the magnitude of a relationship on a scale that always ranges from -1 to +1. However, it does not help much if we want to make a prediction for \(y\) on the basis of a value of \(x\). For that, we need to use regression.

Simple Regression Model

We can make a prediction for the value of \(y\), which we denote \(\hat{y}\) (pronounced “y-hat”), on the basis of the following regression equation:

\[ \hat{y} = a + b x \]

Here \(b\) represents the estimate of how much \(y\) changes for a one-unit increase in \(x\). \(a\), the intercept, is the estimated value of \(y\) if \(x=0\). Note that this is often not a relevant value. For example, if height were the independent variable, one would not expect to encounter someone of height zero. \(a\) is only interpreted when zero is a meaningful value for \(x\).

How can we get the right values to use for \(a\) and \(b\) in the model? The method of least squares is used to calculate the regression equation. This uses the error of each point (the distance from observation to regression line) to find values that minimize that error. The following figure illustrates the definition of the error (also called the residual).

Definition of Error, or Residuals, in Simple Regression

The regression line minimizes the total squared error between all \(x_i\) and the regression line. A little bit of calculus (elided here) leads to the following formulas for the estimates of \(a\) and \(b\) that minimize the sum of squared errors:

\[ \begin{align} b &= \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sum(x_i - \bar{x})^2} \\ a &=\bar{y} - \beta \bar{x} \end{align} \]

We will calculate this manually using the following variables, which we previously created:

  • \((x_i-\bar{x})^2\)
  • \((x_i-\bar{x})(y_i-\bar{y})\)

Recall that \(\sum(x_i-\bar{x})^2 = 418,059.5\) and \(\sum(x_i-\bar{x})(y_i-\bar{y})=112,263.4\). Then,

\[ \beta = \frac{112,263.4}{418,059.5}=0.269. \]

Also recall that \(\bar{y}=50.5\) and \(\bar{x} = 50.12\), so \(\alpha = 50.5 - 0.269*50.12 = 37.02\).

The regression equation for our data is \(y = 37.02 + 0.269x\). That is, the vote share for somebody with a Tweet share of zero is 37.02. For each 1 percentage point increase in Tweet share, the expected vote share increases by .269 percentage points. To visualize, the regression line is the following:

## `geom_smooth()` using formula 'y ~ x'

Scatter plot with regression line

Can we rule out the null hypothesis that the estimated slope is significantly different from zero? As was the case with the correlation, we will use a \(t\)-test. The formula to calculate \(t\) for \(b\), given a null hypothesis that \(b = 0\), is

\[ t = \frac{b - 0}{SE_b} = \frac{b}{SE_b} \]

We divide \(b\) by its standard error to get the \(t\)-statistic, so we need the formula for \(SE_b\). It turns out to be:

\[ SE_b = \frac{SE_R}{\sqrt{\sum(x_i-\bar{x})^2}} \]

The denominator is simply the sum of squared deviations from the mean for the independent variable. The numerator turns out to be another standard error, the residual standard error. The residual is often used interchangeable with the error, shown in the prior figure to be the vertical distance from the observation to the regression line. Let \(e\) denote this error, then \(SE_r\) is calculated as:

\[ SE_R = \sqrt{\frac{\sum e_i^2}{n-2}} \]

where \(\sum e_i^2 = \sum(y_i - \hat{y})^2\). Note that \(\hat{y}\) is the value predicted by the regression equation. To illustrate, the following table presented the following values for the first six observations:

  • predy: \(\hat{y}\)
  • e: \(y-\hat{y}\)
  • e2: \((y-\hat{y})^2\)

    \(y\) $ $ $ (y-)$ $ (y-)^2$
    51.094 44.094 7.000 48.998
    59.481 37.042 22.438 503.475
    57.936 54.516 3.419 11.693
    27.515 45.992 -18.477 341.403
    69.324 58.364 10.960 120.122
    53.203 47.869 5.334 28.449

Then we can sum all of the e2 values across all 406 observations to find:

\(\sum e^2\)
86,273.9

Consequently,

\[ SE_R = \sqrt{\frac{86,273.9}{404}} = 14.61 \]

Now return to the formula for \(t\):

\[ t = \frac{b}{SE_b} \]

where \(SE_b\) is the standard error of \(b\):

\[ SE_b = \frac{SE_R}{\sqrt{\sum(x_i-\bar{x})^2}} \]

Recall from the calculations above that \(\sum(x_i-\bar{x})^2 = 418,059.5\). Then,

\[ SE_\beta = \frac{14.61}{\sqrt{418,059.5}} = 0.0226 \]

Thus,

\[ t = \frac{0.269}{0.0226} = 11.88 \]

If we compare this to a \(t\)-distribution with \(n-2\) degrees of freedom we get a \(p\)-value of \(\lt .001\). Thus, we can conclude that the slope estimate is statistically significant.

Compare the \(t\)-statistic to the test statistic for the correlation. They are the same. This will be the case so long as the null hypothesis for both \(r\) and \(b\) is that each equals zero. There is another connection between Pearson’s \(r\) and simple regression, as illustrated in the next section discussion.

\(F\)-test, \(R^2\), and adjusted \(R^2\)

Working with sums of squares is familiar to anybody who has experience with analysis of variance (ANOVA). Just like with ANOVA, it is possible to partition out the types of variability in a regression model into different types of sums of squares.

  • The residual sum of squares is \(\sum(y_i - \hat{y})^2\), abbreviated as RSS.
  • The regression sum of squares is \(\sum(\hat{y}_i - \bar{y})^2\), abbreviated as RegSS.
  • The total sum of squares is \(\sum(y_i- \bar{y})^2\), abbreviated as TSS.

The residual sum of squares refers to the amount of variability in the observed values around the predicted values, while the regression sum of squares refers to how much variability the predicted values show around the overall mean. The total sum of squares is the amount of variability in the observed values around the overall mean. Note that TSS = RegSS + RSS.

The following table illustrates the calculations for the first six observations.

\(x\) \(y\) \(\bar{y}\) \((y_i-\bar{y})\) \((y_i-\bar{y})^2\) \(\hat{y}\) \((\hat{y}-\bar{y})\) \((\hat{y}-\bar{y})^2\) \((y_i-\hat{y})\) \((y_i-\hat{y})^2\)
26.26263 51.09377 50.50157 0.5921987 0.3506993 44.09392 6.999856 48.99799 -6.407658 41.058076
0.00000 59.48065 50.50157 8.9790780 80.6238413 37.04240 22.438251 503.47509 -13.459173 181.149330
65.07937 57.93567 50.50157 7.4340970 55.2657986 54.51621 3.419460 11.69271 4.014637 16.117309
33.33333 27.51530 50.50157 -22.9862747 528.3688229 45.99240 -18.477102 341.40330 -4.509173 20.332639
79.41176 69.32448 50.50157 18.8229065 354.3018094 58.36446 10.960020 120.12205 7.862886 61.824978
40.32258 53.20280 50.50157 2.7012247 7.2966151 47.86901 5.333785 28.44926 -2.632560 6.930371

Note that the variables in the data are:

  • \(\bar{y}\): Mean of Y
  • \(\hat{y}\): Predicted value of Y
  • \(\hat{y}-\bar{y}\): Predicted minus mean of Y
  • \((\hat{y}-\bar{y})^2\): (Predicted minus mean of Y)\(^2\)
  • \(y_i - \bar{y}\): Actual value minus mean of Y
  • \((y_i - \bar{y})^2\): (Actual value minus mean of Y)\(^2\)
  • \(y_i-\hat{y}\): Actual value minus predicted value
  • \((y_i-\hat{y})^2\): (Actual value minus predicted value)\(^2\)

The sums of squares are found by taking the sum across all observations for the corresponding column. That is:

  • The residual sum of squares is equal to the sum of the \((y_i-\hat{y})^2\) column.
  • The regression sum of squares is equal to the sum of the \((\hat{y}-\bar{y})^2\) column.
  • The total sum of squares is equal to the sum of the \((y_i - \bar{y})^2\) column.

    TSS RSS RegSS
    116,420.5 86,273.9 30,138.85

Just like with ANOVA, it is possible to calculate an omnibus \(F\)-test based on how much variance is explained by the model relative to the error variance. Define:

  • Regression Mean Square (MSReg) = \(\frac{SSReg}{k-1}\)
  • Residual Mean Square (MSR) = \(\frac{SSR}{n - k}\)
  • Total Mean Square (MST) = \(\frac{SST}{n - 1}\)

where \(n\) is the number of observations and \(k\) is the number of terms to be estimated. Here, \(k = 2\) because we are estimating the intercept and the slope. The \(F\)-statistic is calculated as:

\[ F = \frac{MSReg}{MSR} \]

Given our calculations from above:

\[ F = \left. \frac{30,138.85}{1} \middle/ \frac{86,279.9}{404} \right. = 141.12 \]

The \(F\)-test tells us whether we can do a better job predicting the outcome based on our model compared to just knowing the mean of \(y\). For simple regression (i.e. one predictor), the \(F\)-test turns out to be redundant to the test of \(b\). This is because any \(F\)-test with one numerator degree of freedom and \(n-k\) denominator degrees of freedom can be written as a \(t\)-test with \(n-k\) degrees of freedom by taking the square root of \(F\). That is,

\[ \begin{align} t &= \sqrt{F} \\ &= \sqrt{141.2} \\ &= 11.88 \ \end{align} \]

which was the value of our \(t\)-statistic.

Another summary statistic for the model as a whole is \(R^2\), which is interpreted as the amount of variability in the outcome that is explained by the model. Like \(F\), \(R^2\) is calculated using the sums of squares calculated above.

\[ R^2 = \frac{RegSS}{TSS} \]

Based on our results,

\[ R^2 = \frac{30,138.85}{116,420.5}= 0.259 \]

This tells us that the Tweet share explains about 25.9% of the variability in a candidate’s vote share. Note that, in the case of simple regression (just one predictor), \(R^2\) is equal to the correlation (Pearson’s \(r\)), squared. Recall above that we estimated \(r = .509\). Squaring:

\[ R^2 = .509^2 = .259 \]

The adjusted \(R^2\) is another tool to determine goodness of model fit. It is typically used in the case of multiple regression to account for the addition of possibly irrelevant predictors. It is given by:

\[ R^2_{adj} = \frac{MST-MSR}{MST} \]

Based on our numbers calculated above,

\[ R^2_{adj} = \frac{\frac{116,420.5 }{405}-\frac{86,273.9}{404}}{\frac{11,6420.5}{405}} = 0.257 \] \(R^2_{adj}\) is interpreted similarly to \(R^2\), but its value will generally be lower, representing a more conservative description of variance explained. The motivation for the adjustment is more evident in the case of multiple regression, when a larger set of independent variables is included in the model. The idea is that adding irrelevant predictors may inflate the first \(R^2\) simply due to random covariation between the irrelevant predictor and the outcome. Using mean squares rather than sums of squares provides a penalization for adding terms that are not truly explanatory.