What is R-squared?

Caleb Scheidel

Posted on
r-squared regression

R-squared (\(R^2\)) is one of the most commonly used goodness-of-fit measures for linear regression. It uses a scale ranging from zero to one to reflect how well the independent variables in a model explain the variability in the outcome variable. Also called the coefficient of determination, an \(R^2\) value of 0 shows that the regression model does not explain any of the variation in the outcome variable, while an \(R^2\) of 1 indicates that the model explains all of the variation in the outcome variable. This post will walk you through how to calculate \(R^2\), how to assess if your model has a “good” \(R^2\) value, as well as present some of the limitations of using \(R^2\) to assess model fit.

The data used in this tutorial are again from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.

The variables used in this tutorial are:

  • vote_share (dependent variable): The percent of voters for a Republican candidate
  • mshare (independent variable): The percent of social media posts for a Republican candidate
  • mccain_tert (independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.

Take a look at the first six observations in the data:

Tweet Share Vote Share McCain Vote Share Tertile
51.09 26.26 Top Tertile
59.48 0.00 Top Tertile
57.94 65.08 Top Tertile
27.52 33.33 Bottom Tertile
69.32 79.41 Top Tertile
53.20 40.32 Top Tertile

How to calculate R-squared

Let’s first focus on the context of simple regression, with one continuous predictor variable in the model. We will use vote share as the outcome variable, and tweet share as the lone predictor variable. To get an idea of the relationship between these two variables, we can visualize them using a scatterplot and regression line:

The relationship between vote share and tweet share is positive and strong, with a Pearson’s r correlation of 0.509, though the observed values do not appear to be very tightly clustered around the regression line. The regression equation for our data is y = 37.02 + 0.269x. That is, the vote share for somebody with a Tweet share of zero is 37.02. For each one percentage point increase in Tweet share, the expected vote share increases by .269 percentage points. Our tutorial on simple regression walks through how the correlation and regression equation were calculated from the data in this example.

The calculation of \(R^2\) builds off of the sums of squares framework. We can partition out the types of variability in a regression model into different types of sums of squares:

  • The residual sum of squares is \(\sum(y_i - \hat{y})^2\), abbreviated as RSS.
  • The regression sum of squares is \(\sum(\hat{y}_i - \bar{y})^2\), abbreviated as RegSS.
  • The total sum of squares is \(\sum(y_i- \bar{y})^2\), abbreviated as TSS.

The residual sum of squares refers to the amount of variability in the observed values around the predicted values, while the regression sum of squares refers to how much variability the predicted values show around the overall mean. The total sum of squares is the amount of variability in the observed values around the overall mean. Note that TSS = RegSS + RSS.

\(R^2\) is calculated using the formula:

\[ R^2 = \frac{RegSS}{TSS} \]

Regression models with higher \(R^2\) values will have more tightly clustered observed values around the regression line (fitted values) as compared to models with lower \(R^2\) values.

The following table illustrates the calculations for the first six observations in the dataset.

\(x\) \(y\) \(\bar{y}\) \((y_i-\bar{y})\) \((y_i-\bar{y})^2\) \(\hat{y}\) \((\hat{y}-\bar{y})\) \((\hat{y}-\bar{y})^2\) \((y_i-\hat{y})\) \((y_i-\hat{y})^2\)
26.26263 51.09377 50.50157 0.5921987 0.3506993 44.09392 6.999856 48.99799 -6.407658 41.058076
0.00000 59.48065 50.50157 8.9790780 80.6238413 37.04240 22.438251 503.47509 -13.459173 181.149330
65.07937 57.93567 50.50157 7.4340970 55.2657986 54.51621 3.419460 11.69271 4.014637 16.117309
33.33333 27.51530 50.50157 -22.9862747 528.3688229 45.99240 -18.477102 341.40330 -4.509173 20.332639
79.41176 69.32448 50.50157 18.8229065 354.3018094 58.36446 10.960020 120.12205 7.862886 61.824978
40.32258 53.20280 50.50157 2.7012247 7.2966151 47.86901 5.333785 28.44926 -2.632560 6.930371

Note that the variables in the data are:

  • \(\bar{y}\): Mean of Y
  • \(\hat{y}\): Predicted value of Y
  • \(\hat{y}-\bar{y}\): Predicted minus mean of Y
  • \((\hat{y}-\bar{y})^2\): (Predicted minus mean of Y)\(^2\)
  • \(y_i - \bar{y}\): Actual value minus mean of Y
  • \((y_i - \bar{y})^2\): (Actual value minus mean of Y)\(^2\)
  • \(y_i-\hat{y}\): Actual value minus predicted value
  • \((y_i-\hat{y})^2\): (Actual value minus predicted value)\(^2\)

The sums of squares are found by taking the sum across all observations for the corresponding column. That is:

  • The residual sum of squares is equal to the sum of the \((y_i-\hat{y})^2\) column.
  • The regression sum of squares is equal to the sum of the \((\hat{y}-\bar{y})^2\) column.
  • The total sum of squares is equal to the sum of the \((y_i - \bar{y})^2\) column.
TSS RSS RegSS
116,420.5 86,273.9 30,138.85

Based on our results,

\[ R^2 = \frac{30,138.85}{116,420.5}= 0.259 \]

This tells us that the Tweet share explains about 25.9% of the variability in a candidate’s vote share. Note that, in the case of simple regression (just one predictor), \(R^2\) is equal to the correlation (Pearson’s \(r\)), squared. Recall above that we estimated \(r = .509\). Squaring:

\[ R^2 = .509^2 = .259 \]

A similar concept applies in the multiple regression context, where \(R^2\) is equal to the multiple correlation between \(y\) and the combination of independent variables. The sums of squares are calculated using the same formula as the simple regression context, with the only difference being that our predictions, \(\hat{y}\), come from a model with more than one independent variable. Our tutorial on multiple regression shows a full example of how to calculate \(R^2\) from a multiple regression model.

What is a “good” R-squared value?

How can we assess whether our model has a “good” or acceptable \(R^2\) value? In the simple regression case, we can use Cohen’s (1988) heuristics for evaluating correlations. A correlation coefficient of .10 (\(R^2\) = 0.01) is generally considered to be a weak or small association; a correlation coefficient of .30 (\(R^2\) = 0.09) is considered a moderate association; and a correlation coefficient of .50 (\(R^2\) = 0.25) or larger is thought to represent a strong or large association.

In the multiple regression context, we can similarly assess Cohen’s \(f^2\) effect size measure, which is defined as:

\[ f^2 = \frac{R^2}{1 - R^2} \] An \(f^2\) of 0.02 (\(R^2\) = 0.02) is generally considered to be a weak or small effect; an \(f^2\) of 0.15 (\(R^2\) = 0.13) is considered a moderate effect; and an \(f^2\) of 0.35 (\(R^2\) = 0.26) is thought to represent a strong or large effect.

Is there a significance test for R-squared?

The F-test from a regression is used to assess the statistical significance of \(R^2\). As an example, let’s fit a simple linear regression model with a polytomous independent variable. In our data, we will use the following variables:

  • vote_share (dependent variable): The percent of voters for a Republican candidate
  • mccain_tert (independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.

Although we think of the tertile as a single variable with three levels, we will actually have to recode it into what are called dummy variables. A dummy variable is a variable that takes on only one of two values. It is coded one if an observation belongs to a certain category and zero if the observation does not.

We can code a dummy variable for the bottom tertile where the variable equals one if the person is in it and zero if they are not.

We code a second dummy variable for the middle tertile where the variable equals one if they are in the middle and zero if they’re not.

Note that we do not need to code a third dummy variable for the top tertile, since if a given voter is zero for bottom and middle, they must be in the top. Whenever there are \(k\) categories, it is necessary to create \(k − 1\) dummy variables. To illustrate,

Vote Share Bottom Middle
51.09 0 0
59.48 0 0
57.94 0 0
27.52 1 0
69.32 0 0
53.20 0 0

The fourth observation in the data is a district whose McCain vote was in the bottom tertile. None of these observations were in the middle tertile, which means all but the fourth were in the top tertile.

To estimate a regression model with dummy variables, we add each of the \(k − 1\) dummy variables as separate predictors.

\[ \text{Vote Share}=a +b_1(\text{Bottom}) + b_2(\text{Middle}) \]

If we estimate the model, we get a separate \(b\) coefficient for each of the tertiles. Our estimates turn out to be:

\[ \hat{y} = 66.49 - 34.23(\text{Bottom}) - 10.35(\text{Middle}) \]

This model has an \(R^2\) of 0.6968. To assess whether the independent variable, McCain tertile, explains a significant amount of variation in the vote share outcome variable, we would look to the \(F\) test. The regression \(F\) statistic is found using the following formula:

\[ F = \frac{MSReg}{MSR} \]

Where

  • Regression Mean Square, \(MSReg\) = \(\frac{RegSS}{k-1}\)
  • Residual Mean Square, \(MSR\) = \(\frac{RSS}{n-k}\)

Note that \(k\) is the number of statistics estimated in the regression model. Here \(k = 3\) for two dummy variables plus an intercept.

\(RegSS\) is the regression sum of squares, given by:

\[ \sum(\hat{y}_i - \bar{y})^2 \]

The regression sum of squares sums over the squared difference between an observation’s predicted value (\(\hat{y}\)) and the overall average (\(\bar{y}\)).

\(y\) \(\hat{y}\) \(\bar{y}\) \((\hat{y}-\bar{y})\) \((\hat{y}-\bar{y})^2\)
51.09 66.49 50.5 15.99 255.74
59.48 66.49 50.5 15.99 255.74
57.94 66.49 50.5 15.99 255.74
27.52 32.26 50.5 -18.24 332.81
69.32 66.49 50.5 15.99 255.74
53.20 66.49 50.5 15.99 255.74
  • \(y\) is vote_share
  • \(\hat{y}\) is the predicted value
  • \(\bar{y}\) is the mean of vote_share
  • \((\hat{y} - \bar{y})\) is the predicted minus the mean value
  • \((\hat{y} - \bar{y})^2\) is the predicted minus the mean value squared

Summing over all 406 observations, the RegSS is:

\(RegSS\)
81,117.6

\(RSS\) is the residual sum of squares, given by:

\[ \sum(y_i - \hat{y})^2 \]

To illustrate, consider again the first six observations:

\(y\) \(\hat{y}\) \((y-\hat{y})\) \((y-\hat{y})^2\)
51.09 66.49 -15.40 237.15
59.48 66.49 -7.01 49.18
57.94 66.49 -8.56 73.24
27.52 32.26 -4.74 22.50
69.32 66.49 2.83 8.01
53.20 66.49 -13.29 176.64
  • \(y\) is vote_share
  • \(\hat{y}\) is the predicted value
  • \((y - \hat{y})\) is the actual minus predicted value
  • \((y - \hat{y})^2\) is the actual minus predicted value squared

We then sum the \((y - \hat{y})^2\) column to find the RSS is:

\(RSS\)
35,303.06

The null hypothesis is that the independent variables together do not explain any variability in the dependent variable.

\[ F = \left. \frac{81,117.6}{2} \middle/ \frac{35,303.06}{403} \right. = 463.00 \]

We can compare the calculated \(F\) statistic against an \(F\) distribution with degrees of freedom equal to \(k=3\) and \(N -k=403\). We find that the model is significant with a \(p\)-value of \(\lt .001\). In other words, we can reject the null hypothesis that there are no differences between McCain support tertiles on Republican vote shares.

Note that, for a single categorical variable, this is exactly the same as the \(F\) test from a one-way ANOVA. However, regression allows for any combination of categorical and continuous variables. ANOVA, therefore, can be considered a special case of multiple regression.

Term DF Sum of Squares Mean Square F Statistic p-value
McCain Tertile 2 81117.47 40558.73 463 < 0.001
Residuals 403 35303.06 87.60 NA NA

What is the difference between R-squared and adjusted R-squared?

A more conservative estimate of the variance explained by a model is the adjusted R-squared (\(R^2_{adj}\)). It is given by:

\[ R^2_{adj} = \frac{MST-MSR}{MST} \]

where MST = \(\frac{TSS}{n-1}\).

\(R^2_{adj}\) is interpreted similarly to \(R^2\), but its value will generally be lower, representing a more conservative description of variance explained. The motivation for the adjustment is more evident in the case of multiple regression, when a larger set of independent variables is included in the model. The idea is that adding irrelevant predictors may inflate the unadjusted \(R^2\) simply due to random covariation between the irrelevant predictor and the outcome. Using mean squares rather than sums of squares provides penalization for adding terms that are not truly explanatory.

Based on our numbers calculated above for the model with dummy variables for McCain tertile and tweet share as predictors, we find:

\[ R^2_{adj} = \frac{\frac{116,421.1}{405}-\frac{31,068.04}{402}}{\frac{116,421.1}{405}} = 0.7311 \] as compared to the \(R^2\) value of 0.7331.

What are the limitations of R-squared?

\(R^2\) has some limitations and can easily be misinterpreted, particularly if your model overfits the sample you are using. \(R^2\) can be misleading in this case, as you could have an entirely random relationship between your predictor variables and the outcome, and it would still show a high value close to one. A common mistake is to throw in too many predictors into a model relative to the number of observations. When this happens, the regression coefficients do not accurately represent the actual relationships in the population but can still lead to a high \(R^2\). Therefore, the goal should not be to maximize \(R^2\), but rather focus on the theoretical hypotheses of your model. If the goal of your analysis is prediction rather than causal inference, make sure to use cross-validation and assess model fit using \(R^2\) on a hold-out set to avoid overfitting.