R-squared (\(R^2\)) is one of the most commonly used goodness-of-fit measures for linear regression. It uses a scale ranging from zero to one to reflect how well the independent variables in a model explain the variability in the outcome variable. Also called the coefficient of determination, an \(R^2\) value of 0 shows that the regression model does not explain any of the variation in the outcome variable, while an \(R^2\) of 1 indicates that the model explains all of the variation in the outcome variable. This post will walk you through how to calculate \(R^2\), how to assess if your model has a “good” \(R^2\) value, as well as present some of the limitations of using \(R^2\) to assess model fit.
The data used in this tutorial are again from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.
The variables used in this tutorial are:
vote_share
(dependent variable): The percent of voters for a Republican candidatemshare
(independent variable): The percent of social media posts for a Republican candidatemccain_tert
(independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.
Take a look at the first six observations in the data:
Tweet Share | Vote Share | McCain Vote Share Tertile |
---|---|---|
51.09 | 26.26 | Top Tertile |
59.48 | 0.00 | Top Tertile |
57.94 | 65.08 | Top Tertile |
27.52 | 33.33 | Bottom Tertile |
69.32 | 79.41 | Top Tertile |
53.20 | 40.32 | Top Tertile |
How to calculate R-squared
Let’s first focus on the context of simple regression, with one continuous predictor variable in the model. We will use vote share as the outcome variable, and tweet share as the lone predictor variable. To get an idea of the relationship between these two variables, we can visualize them using a scatterplot and regression line:
The relationship between vote share and tweet share is positive and strong, with a Pearson’s r correlation of 0.509, though the observed values do not appear to be very tightly clustered around the regression line. The regression equation for our data is y = 37.02 + 0.269x. That is, the vote share for somebody with a Tweet share of zero is 37.02. For each one percentage point increase in Tweet share, the expected vote share increases by .269 percentage points. Our tutorial on simple regression walks through how the correlation and regression equation were calculated from the data in this example.
The calculation of \(R^2\) builds off of the sums of squares framework. We can partition out the types of variability in a regression model into different types of sums of squares:
- The residual sum of squares is \(\sum(y_i - \hat{y})^2\), abbreviated as RSS.
- The regression sum of squares is \(\sum(\hat{y}_i - \bar{y})^2\), abbreviated as RegSS.
- The total sum of squares is \(\sum(y_i- \bar{y})^2\), abbreviated as TSS.
The residual sum of squares refers to the amount of variability in the observed values around the predicted values, while the regression sum of squares refers to how much variability the predicted values show around the overall mean. The total sum of squares is the amount of variability in the observed values around the overall mean. Note that TSS = RegSS + RSS.
\(R^2\) is calculated using the formula:
\[ R^2 = \frac{RegSS}{TSS} \]
Regression models with higher \(R^2\) values will have more tightly clustered observed values around the regression line (fitted values) as compared to models with lower \(R^2\) values.
The following table illustrates the calculations for the first six observations in the dataset.
\(x\) | \(y\) | \(\bar{y}\) | \((y_i-\bar{y})\) | \((y_i-\bar{y})^2\) | \(\hat{y}\) | \((\hat{y}-\bar{y})\) | \((\hat{y}-\bar{y})^2\) | \((y_i-\hat{y})\) | \((y_i-\hat{y})^2\) |
---|---|---|---|---|---|---|---|---|---|
26.26263 | 51.09377 | 50.50157 | 0.5921987 | 0.3506993 | 44.09392 | 6.999856 | 48.99799 | -6.407658 | 41.058076 |
0.00000 | 59.48065 | 50.50157 | 8.9790780 | 80.6238413 | 37.04240 | 22.438251 | 503.47509 | -13.459173 | 181.149330 |
65.07937 | 57.93567 | 50.50157 | 7.4340970 | 55.2657986 | 54.51621 | 3.419460 | 11.69271 | 4.014637 | 16.117309 |
33.33333 | 27.51530 | 50.50157 | -22.9862747 | 528.3688229 | 45.99240 | -18.477102 | 341.40330 | -4.509173 | 20.332639 |
79.41176 | 69.32448 | 50.50157 | 18.8229065 | 354.3018094 | 58.36446 | 10.960020 | 120.12205 | 7.862886 | 61.824978 |
40.32258 | 53.20280 | 50.50157 | 2.7012247 | 7.2966151 | 47.86901 | 5.333785 | 28.44926 | -2.632560 | 6.930371 |
Note that the variables in the data are:
- \(\bar{y}\): Mean of Y
- \(\hat{y}\): Predicted value of Y
- \(\hat{y}-\bar{y}\): Predicted minus mean of Y
- \((\hat{y}-\bar{y})^2\): (Predicted minus mean of Y)\(^2\)
- \(y_i - \bar{y}\): Actual value minus mean of Y
- \((y_i - \bar{y})^2\): (Actual value minus mean of Y)\(^2\)
- \(y_i-\hat{y}\): Actual value minus predicted value
- \((y_i-\hat{y})^2\): (Actual value minus predicted value)\(^2\)
The sums of squares are found by taking the sum across all observations for the corresponding column. That is:
- The residual sum of squares is equal to the sum of the \((y_i-\hat{y})^2\) column.
- The regression sum of squares is equal to the sum of the \((\hat{y}-\bar{y})^2\) column.
- The total sum of squares is equal to the sum of the \((y_i - \bar{y})^2\) column.
TSS | RSS | RegSS |
---|---|---|
116,420.5 | 86,273.9 | 30,138.85 |
Based on our results,
\[ R^2 = \frac{30,138.85}{116,420.5}= 0.259 \]
This tells us that the Tweet share explains about 25.9% of the variability in a candidate’s vote share. Note that, in the case of simple regression (just one predictor), \(R^2\) is equal to the correlation (Pearson’s \(r\)), squared. Recall above that we estimated \(r = .509\). Squaring:
\[ R^2 = .509^2 = .259 \]
A similar concept applies in the multiple regression context, where \(R^2\) is equal to the multiple correlation between \(y\) and the combination of independent variables. The sums of squares are calculated using the same formula as the simple regression context, with the only difference being that our predictions, \(\hat{y}\), come from a model with more than one independent variable. Our tutorial on multiple regression shows a full example of how to calculate \(R^2\) from a multiple regression model.
What is a “good” R-squared value?
How can we assess whether our model has a “good” or acceptable \(R^2\) value? In the simple regression case, we can use Cohen’s (1988) heuristics for evaluating correlations. A correlation coefficient of .10 (\(R^2\) = 0.01) is generally considered to be a weak or small association; a correlation coefficient of .30 (\(R^2\) = 0.09) is considered a moderate association; and a correlation coefficient of .50 (\(R^2\) = 0.25) or larger is thought to represent a strong or large association.
In the multiple regression context, we can similarly assess Cohen’s \(f^2\) effect size measure, which is defined as:
\[ f^2 = \frac{R^2}{1 - R^2} \] An \(f^2\) of 0.02 (\(R^2\) = 0.02) is generally considered to be a weak or small effect; an \(f^2\) of 0.15 (\(R^2\) = 0.13) is considered a moderate effect; and an \(f^2\) of 0.35 (\(R^2\) = 0.26) is thought to represent a strong or large effect.
Is there a significance test for R-squared?
The F-test from a regression is used to assess the statistical significance of \(R^2\). As an example, let’s fit a simple linear regression model with a polytomous independent variable. In our data, we will use the following variables:
vote_share
(dependent variable): The percent of voters for a Republican candidatemccain_tert
(independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.
Although we think of the tertile as a single variable with three levels, we will actually have to recode it into what are called dummy variables. A dummy variable is a variable that takes on only one of two values. It is coded one if an observation belongs to a certain category and zero if the observation does not.
We can code a dummy variable for the bottom tertile where the variable equals one if the person is in it and zero if they are not.
We code a second dummy variable for the middle tertile where the variable equals one if they are in the middle and zero if they’re not.
Note that we do not need to code a third dummy variable for the top tertile, since if a given voter is zero for bottom and middle, they must be in the top. Whenever there are \(k\) categories, it is necessary to create \(k − 1\) dummy variables. To illustrate,
Vote Share | Bottom | Middle |
---|---|---|
51.09 | 0 | 0 |
59.48 | 0 | 0 |
57.94 | 0 | 0 |
27.52 | 1 | 0 |
69.32 | 0 | 0 |
53.20 | 0 | 0 |
The fourth observation in the data is a district whose McCain vote was in the bottom tertile. None of these observations were in the middle tertile, which means all but the fourth were in the top tertile.
To estimate a regression model with dummy variables, we add each of the \(k − 1\) dummy variables as separate predictors.
\[ \text{Vote Share}=a +b_1(\text{Bottom}) + b_2(\text{Middle}) \]
If we estimate the model, we get a separate \(b\) coefficient for each of the tertiles. Our estimates turn out to be:
\[ \hat{y} = 66.49 - 34.23(\text{Bottom}) - 10.35(\text{Middle}) \]
This model has an \(R^2\) of 0.6968. To assess whether the independent variable, McCain tertile, explains a significant amount of variation in the vote share outcome variable, we would look to the \(F\) test. The regression \(F\) statistic is found using the following formula:
\[ F = \frac{MSReg}{MSR} \]
Where
- Regression Mean Square, \(MSReg\) = \(\frac{RegSS}{k-1}\)
- Residual Mean Square, \(MSR\) = \(\frac{RSS}{n-k}\)
Note that \(k\) is the number of statistics estimated in the regression model. Here \(k = 3\) for two dummy variables plus an intercept.
\(RegSS\) is the regression sum of squares, given by:
\[ \sum(\hat{y}_i - \bar{y})^2 \]
The regression sum of squares sums over the squared difference between an observation’s predicted value (\(\hat{y}\)) and the overall average (\(\bar{y}\)).
\(y\) | \(\hat{y}\) | \(\bar{y}\) | \((\hat{y}-\bar{y})\) | \((\hat{y}-\bar{y})^2\) |
---|---|---|---|---|
51.09 | 66.49 | 50.5 | 15.99 | 255.74 |
59.48 | 66.49 | 50.5 | 15.99 | 255.74 |
57.94 | 66.49 | 50.5 | 15.99 | 255.74 |
27.52 | 32.26 | 50.5 | -18.24 | 332.81 |
69.32 | 66.49 | 50.5 | 15.99 | 255.74 |
53.20 | 66.49 | 50.5 | 15.99 | 255.74 |
- \(y\) is
vote_share
- \(\hat{y}\) is the predicted value
- \(\bar{y}\) is the mean of
vote_share
- \((\hat{y} - \bar{y})\) is the predicted minus the mean value
- \((\hat{y} - \bar{y})^2\) is the predicted minus the mean value squared
Summing over all 406 observations, the RegSS is:
\(RegSS\) |
---|
81,117.6 |
\(RSS\) is the residual sum of squares, given by:
\[ \sum(y_i - \hat{y})^2 \]
To illustrate, consider again the first six observations:
\(y\) | \(\hat{y}\) | \((y-\hat{y})\) | \((y-\hat{y})^2\) |
---|---|---|---|
51.09 | 66.49 | -15.40 | 237.15 |
59.48 | 66.49 | -7.01 | 49.18 |
57.94 | 66.49 | -8.56 | 73.24 |
27.52 | 32.26 | -4.74 | 22.50 |
69.32 | 66.49 | 2.83 | 8.01 |
53.20 | 66.49 | -13.29 | 176.64 |
- \(y\) is
vote_share
- \(\hat{y}\) is the predicted value
- \((y - \hat{y})\) is the actual minus predicted value
- \((y - \hat{y})^2\) is the actual minus predicted value squared
We then sum the \((y - \hat{y})^2\) column to find the RSS is:
\(RSS\) |
---|
35,303.06 |
The null hypothesis is that the independent variables together do not explain any variability in the dependent variable.
\[ F = \left. \frac{81,117.6}{2} \middle/ \frac{35,303.06}{403} \right. = 463.00 \]
We can compare the calculated \(F\) statistic against an \(F\) distribution with degrees of freedom equal to \(k=3\) and \(N -k=403\). We find that the model is significant with a \(p\)-value of \(\lt .001\). In other words, we can reject the null hypothesis that there are no differences between McCain support tertiles on Republican vote shares.
Note that, for a single categorical variable, this is exactly the same as the \(F\) test from a one-way ANOVA. However, regression allows for any combination of categorical and continuous variables. ANOVA, therefore, can be considered a special case of multiple regression.
Term | DF | Sum of Squares | Mean Square | F Statistic | p-value |
---|---|---|---|---|---|
McCain Tertile | 2 | 81117.47 | 40558.73 | 463 | < 0.001 |
Residuals | 403 | 35303.06 | 87.60 | NA | NA |
What is the difference between R-squared and adjusted R-squared?
A more conservative estimate of the variance explained by a model is the adjusted R-squared (\(R^2_{adj}\)). It is given by:
\[ R^2_{adj} = \frac{MST-MSR}{MST} \]
where MST = \(\frac{TSS}{n-1}\).
\(R^2_{adj}\) is interpreted similarly to \(R^2\), but its value will generally be lower, representing a more conservative description of variance explained. The motivation for the adjustment is more evident in the case of multiple regression, when a larger set of independent variables is included in the model. The idea is that adding irrelevant predictors may inflate the unadjusted \(R^2\) simply due to random covariation between the irrelevant predictor and the outcome. Using mean squares rather than sums of squares provides penalization for adding terms that are not truly explanatory.
Based on our numbers calculated above for the model with dummy variables for McCain tertile and tweet share as predictors, we find:
\[ R^2_{adj} = \frac{\frac{116,421.1}{405}-\frac{31,068.04}{402}}{\frac{116,421.1}{405}} = 0.7311 \] as compared to the \(R^2\) value of 0.7331.
What are the limitations of R-squared?
\(R^2\) has some limitations and can easily be misinterpreted, particularly if your model overfits the sample you are using. \(R^2\) can be misleading in this case, as you could have an entirely random relationship between your predictor variables and the outcome, and it would still show a high value close to one. A common mistake is to throw in too many predictors into a model relative to the number of observations. When this happens, the regression coefficients do not accurately represent the actual relationships in the population but can still lead to a high \(R^2\). Therefore, the goal should not be to maximize \(R^2\), but rather focus on the theoretical hypotheses of your model. If the goal of your analysis is prediction rather than causal inference, make sure to use cross-validation and assess model fit using \(R^2\) on a hold-out set to avoid overfitting.