How to Estimate a Multiple Linear Regression Model

Posted on
r-squared regression

Multiple Regression

A prior tutorial described simple regression as a mapping of a single predictor to an outcome variable. This tutorial covers the case when there is more than one independent variable, also known as multiple regression. Although simple regression is a useful tool for extracting information about bivariate relationships that goes beyond what we get from a correlation or t-test, the real power of regression comes from its ability to incorporate multiple independent variables. This tutorial will build on the concepts discussed in our simple regression tutorial to explain multiple regression.

The data used in this tutorial are again from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.

The variables used in this tutorial are:

  • vote_share (dependent variable): The percent of voters for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate
  • pct_white (independent variable): The percentage of white voters in a given Congressional district

Take a look at the first six observations in the data:

Tweet Share Vote Share Percent White
51.09 26.26 64.2
59.48 0.00 64.3
57.94 65.08 75.7
27.52 33.33 34.6
69.32 79.41 66.8
53.20 40.32 70.8

The simple regression post established an association between Republican tweets and Republican votes, but is this relationship spurious? In other words, is there some other variable, for example race, that is confounding this relationship?

In multiple regression, we partial out the independent effect that each independent variable has on the dependent variable. The following graphic demonstrates the sources of variation that are used to estimate \(y\) in a model with two independent variables. Think of \(x_1\) as the variable “Tweet Share” and \(x_2\) as the variable “Percent White.”

Sources of variation in multiple regression

The covariance of \(x_1\) with \(y\) is equal to the blue area plus the orange area, but some of this covariance overlaps with \(x_2\). The partial or independent effect of \(x_1\) is only the blue area.

The covariance of \(x_2\) with \(y\) is equal to the green area plus the orange area, but the partial or independent effect of \(x_2\) is only the green area.

The total effect of \(x_1\) and \(x_2\) on \(y\) is equal to the blue plus orange plus green areas.

The regression model with two variables looks like:

\[ y = a + b_1 x_1 + b_2 x_2 \]

Think of \(b_1\) as the blue area in the above figure and \(b_2\) as the green area. The model’s \(R^2\) (defined below) is the sum of the blue, green, and orange areas.

Where do we get our estimates of \(b_k\)? When there was just one variable, we found the line that minimized the sum of squared errors. That is, we could draw a line through the data points that minimized the squared distance from the line to each observed value.

Simple Regression Illustration

When we have two independent variables, we move into three dimensions.

Regression plot in 3 dimensions

Instead of a line that minimizes the sum of squared errors, we find the plane that does so.

Regression plot in 3 dimensions with plane of best fit

The slant of the plane shows us that the percent of Republican votes increases with the percent white and with tweets shared.

Assuming we have a large enough sample size so that we have sufficient information to estimate our effects, we can continue adding independent variables. We quickly move beyond three dimensions, though, so we cannot visualize the surface we are estimating.

Our Model

Calculating each \(b_k\) in a multiple regression model becomes very complicated and involves matrix algebra. If we carried out the calculations we would end up with the following model:

The regression equation for our data is:

\[ y = 0.865 + 0.178x_1 + 0.55x_2 \]

where \(y\) is vote_share, \(x_1\) is mshare and \(x_2\) is pct_white.

The coefficient \(0.865\) is the value for \(y\) when \(x_1\) and \(x_2\) both are zero. In this case, it is unlikely that there are any candidates from districts that were zero percent white with zero Tweets mentioning the Republican candidate, so we won’t spend time interpreting that number. More interesting is how much the outcome varies when the predictors are changed.

From the model, we gather that a one percentage point increase in tweet sharing increases vote share by \(0.178\), holding percent white constant. For a one percentage point increase in percent white, votes increase by \(0.55\), holding tweets constant.

We know that, even if the true values for our coefficients were zero, we’d still see values that are larger or smaller than zero simply due to sample-to-sample variability. Are our estimates large enough to be unlikely under the null hypothesis? We can assess this using \(t\)-tests.

Hypothesis Tests for Coefficients

We can create a separate hypothesis for each of our \(b_k\) values:

\[ H_0: b_k=0 \\ H_a: b_k \ne 0 \]

where \(k\) indexes a specific independent variable. Can we reject the null hypothesis that \(b_1 = 0\)? What about \(b_2 = 0\)? We use a \(t\)-statistic to perform our hypothesis test. It is calculated as:

\[ t = \frac{b}{SE_b} \]

The standard error for \(b\) is:

\[ SE_b = \frac{SE_R}{\sqrt{\sum(x-\bar{x})^2}} \]

The numerator is called the residual standard error (equivalently, it is sometimes called the standard error of the residual). The formula to find the residual standard error is:

\[ SE_R = \sqrt{\frac{RSS}{N-k}} \]

where \(k\) is the number of model parameters, including the intercept. RSS is the error, or residual, sum of squares, and can be found using:

\[ RSS = \sum_{i=1}^n(y_i-\hat{y})^2 \]

where \(\hat{y}\) is the estimate for \(y\) given our regression equation. To illustrate, consider the first six observations in the data.

\(y\) \(\hat{y}\) \((y-\hat{y})\) \((y-\hat{y})^2\)
51.09 40.85 10.24 104.94
59.48 36.23 23.25 540.59
57.94 54.08 3.85 14.83
27.52 25.83 1.69 2.85
69.32 51.74 17.58 309.20
53.20 46.98 6.22 38.69
  • \(y\) is the observed vote_share value from the data
  • \(\hat{y}\) is the predicted value from the regression equation
  • \((y - \hat{y})\) is the observed value minus the predicted value
  • \((y - \hat{y})^2\) is the observed value minus predicted value squared

Then we sum the \((y - \hat{y})^2\) column across all observations in the data to find that the RSS is:

\(RSS\)
51,909.06

Now we can find \(SE_R\):

\[ SE_R = \sqrt{\frac{51,909.06}{406-3}}=11.35 \]

We can calculate \(\sum(x-\bar{x})^2\) for the tweet share variable and find:

\(x\) \(\bar{x}\) \((x-\bar{x})\) \((x-\bar{x})^2\)
26.26 50.12 -23.86 569.22
0.00 50.12 -50.12 2512.10
65.08 50.12 14.96 223.76
33.33 50.12 -16.79 281.82
79.41 50.12 29.29 857.96
40.32 50.12 -9.80 96.01

The sum over all 406 observations is:

\(\sum(x-\bar{x})^2\)
418,059.5

Thus,

\[ SE_b = \frac{11.35}{\sqrt{418,059.5}}=0.018 \]

and the \(t\) statistic to test the significance of the “Tweet Share” variable can be found as

\[ t = \frac{0.178}{0.018}=9.89 \]

We compare this to a \(t\) distribution with \(N-k = 403\) degrees of freedom, and find that \(p \lt .001\). In other words, we reject the null hypothesis that the effect of Tweet share, holding District race constant, is significantly different from zero.

The same process can be repeated with the percent white variable. We already found \(SE_R = 11.35\), so we will use that to find \(SE_b\).

We can calculate \(\sum(x-\bar{x})^2\) for the percent white variable and find:

\(\sum(x-\bar{x})^2\)
418,059.5

Thus,

\[ SE_b = \frac{11.35}{\sqrt{124,805}}=0.032 \]

and the \(t\) statistic can be found.

\[ t = \frac{0.55}{0.032}=17.19 \]

We compare this to a \(t\) distribution with \(N-k = 403\) degrees of freedom, and find that \(p \lt .001\). That is, the effect of District race, holding Tweet Share constant, is significantly different from zero.

Sum of Squares

What is nice about regression is that, no matter how many predictors there are, the sums of squares calculations are always the same. Just as was done in the simple regression tutorial, it is possible to partition out the types of variability in a regression model into the following:

  • The residual sum of squares is (\(\sum(y_i - \hat{y})^2\)), abbreviated as RSS.
  • The regression sum of squares is (\(\sum(\hat{y}_i - \bar{y})^2\)), abbreviated as RegSS.
  • The total sum of squares is (\(\sum(y_i- \bar{y})^2\)), abbreviated as TSS.

The only thing that is different from simple regression is that our predictions, \(\hat{y}\) comes from a model with more than one independent variable. The calculation of RSS was shown in the prior section, as it is an essential part of calculating the standard error of residuals that are used in hypothesis tests for the coefficients.

To illustrate the calculation of RegSS, take the first six observations’ values for \(y\) and add columns for the model prediction and the overall average of \(y\). The deviations are then easy to estimate:

\(y\) \(\hat{y}\) \(\bar{y}\) \((\hat{y}-\bar{y})\) \((\hat{y}-\bar{y})^2\)
51.09 40.85 50.5 -9.65 93.16
59.48 36.23 50.5 -14.27 203.68
57.94 54.08 50.5 3.58 12.83
27.52 25.83 50.5 -24.67 608.77
69.32 51.74 50.5 1.24 1.53
53.20 46.98 50.5 -3.52 12.38

Summing over all 406 observations, the RegSS is:

\(RegSS\)
64,417.71

The total sum of squared errors, TSS, is equal to the sum of squared differences of each observed value minus the mean.

\[ TSS = \sum_{i=1}^n(y_i-\bar{y})^2 \]

Illustrating with the first six observations:

\(y\) \(\bar{y}\) \((y-\bar{y})\) \((y-\bar{y})^2\)
51.09 50.5 0.59 0.35
59.48 50.5 8.98 80.62
57.94 50.5 7.43 55.27
27.52 50.5 -22.99 528.37
69.32 50.5 18.82 354.30
53.20 50.5 2.70 7.30

Summing over all 406 observations, the TSS is:

\(TSS\)
116,420.5

Finally, We can partition the total sum of squares into the sum of squares explained by the regression, \(RegSS\), or the residual sum of squares, \(RSS\). The total sum of squares is equal to the regression sum of squares plus the residual sum of squares.

\[ TSS = RegSS + RSS \]

\(F\)-test, \(R^2\), and \(R^2_{adj}\)

The regression \(F\) statistic is found similarly to the simple regression model:

\[ F = \frac{MSReg}{MSR} \]

Where

  • Regression Mean Square, \(MSReg\) = \(\frac{RegSS}{k-1}\)
  • Residual Mean Square, \(MSR\) = \(\frac{RSS}{n-k}\)

Note that \(k\) is the number of statistics estimated in the regression model. Here \(k = 3\) for two independent variables plus the constant \(a\). The null hypothesis of the regression \(F\)-test is that the independent variables together do not explain any variability in the dependent variable.

\[ F = \left. \frac{64,417.71}{2} \middle/ \frac{51,909.06}{403} \right. = 250.06 \]

We can compare the calculated \(F\) statistic against an \(F\) distribution with degrees of freedom equal to \(k=3\) and \(N -k=403\). We find that the model is significant with a \(p\)-value of \(\lt .001\). Note that, if just the predictor is significant, the \(F\)-test is nearly always significant and hence is not always interpreted in publications.

The sums of squares can also get us to our \(R^2\), which is now the amount of variance explained in our dependent variable by all of the independent variables simultaneously.

The steps to getting the \(R^2\) are the same as before. We take the ratio of the regression sum of squares to the total sum of squares.

\[ R^2 = \frac{RegSS}{TSS} \]

The value ranges from zero (nothing in the dependent variable is explained by the two variables together) to one (the dependent variable is completely explained by the two variables together).

\[ R^2 = \frac{64,417.71}{11,6420.5} = 0.553 \]

This tells us that tweet share and percent white together explain about 55.3% of the variability in a candidate’s vote share.

A more conservative estimate of the variance explained is the adjusted \(R^2\).

\[ Adj\ R^2 = \frac{MST-MSR}{MST} \]

Where \(MST = \frac{TSS}{n-1}\)

For our model,

\[ Adj \ R^2 = \frac{\frac{116,326.8}{405}-\frac{51,909.06}{403}}{\frac{116,326.8}{405}}=0.551 \]

\(R^2_{adj}\) is interpreted similarly to \(R^2\), but its value will generally be lower, representing a more conservative description of variance explained. The idea is that adding irrelevant predictors may inflate the first \(R^2\) simply due to random covariation between the irrelevant predictor and the outcome. Using mean squares rather than sums of squares provides penalization for adding terms that are not truly explanatory.

Still have questions? Contact us!