An important part of regression modeling is performing diagnostics to verify that assumptions behind the model are met and that there are no problems with the data that are skewing the results. This tutorial builds on prior posts covering simple and multiple regression as well as regression with nominal independent variables. The same data will be used here.

The variables used in this tutorial are:

`vote_share`

(*dependent variable*): The percent of voters for a Republican candidate.`mshare`

(*independent variable*): The percent of social media posts for a Republican candidate.`pct_white`

(*independent variable*): The percent of white voters in a given Congressional district.`mccain_tert`

(*independent variable*): The vote share John McCain received in the 2008 election in the district, divided into tertiles.`rep_inc`

(*independent variable*): Whether the Republican candidate was an incumbent or not.

The `mccain_tert`

will be treated as a categorical predictor and hence will be entered into the regression model as dummy variables with the lowest tertile used as the reference category. Likewise, `rep_inc`

is a categorical variable that takes on two values, 1 = incumbent, 0 = non-incumbent (reference category).

Take a look at the first six observations in the data:

Vote Share | Tweet Share | Republican Incumbent | McCain Tertile | Percent White |
---|---|---|---|---|

51.09 | 26.26 | Republican not Incumbent | Top Tertile | 64.2 |

59.48 | 0.00 | Republican Incumbent | Top Tertile | 64.3 |

57.94 | 65.08 | Republican not Incumbent | Top Tertile | 75.7 |

27.52 | 33.33 | Republican not Incumbent | Bottom Tertile | 34.6 |

69.32 | 79.41 | Republican Incumbent | Top Tertile | 66.8 |

53.20 | 40.32 | Republican not Incumbent | Top Tertile | 70.8 |

If we run the linear regression, we get the following model:

Term | Estimate | SD | t-statistic | p-value |
---|---|---|---|---|

(Intercept) | 13.229 | 1.487 | 8.893 | < 0.001 |

Percent Republican Tweets | 0.041 | 0.012 | 3.504 | < 0.001 |

Republican Incumbent | 12.322 | 0.867 | 14.205 | < 0.001 |

% White Voters | 0.292 | 0.023 | 12.768 | < 0.001 |

Middle McCain Support Tertile | 11.882 | 0.964 | 12.332 | < 0.001 |

Top McCain Support Tertile | 18.721 | 1.103 | 16.975 | < 0.001 |

All of our estimates are significant. Now it’s time to assess the assumptions in our model. The core assumptions underlying multiple regression are:

- Normality of Residuals
- Linearity
- Homoskedasticity
- No Perfect Multicollinearity
- No Outliers
- Independence of Observations

The remainder of this post explains each of these, the consequences of their violation, how to assess if the assumption is met, and what to do if any is violated.

### Normality of Residuals

Recall that a residual is the difference between the actual and predicted values of the dependent variable. The statistical theory underlying hypothesis tests assumes that the distribution of the residuals is normal, though in relatively large samples the Central Limit Theorem kicks in and violations from normality are less consequential.

On the other hand, non-normality can signify other problems with our model specification. For example, it may be evidence that the assumption of linearity is violated for one or more independent variables (discussed below). Alternatively, highly skewed residuals may be indicative that one of the variables in the model itself is highly skewed and would benefit from a transformation.

Another reason to explore the distribution of residuals is because they can sometimes reveal outlying observations. A residual more than two or three standard deviations from the mean (zero) may represent an observation that is having disproportionate influence on the model estimates. The subsection below on outliers explains more comprehensive ways to assess potential outlier problems, but the problem may first appear in your assessment of the distribution of the residuals.

The easiest way to assess normality is through the use of graphs. First, calculate the residual for each subject as the difference between the observed value and the model predicted value. Next, standardize the residuals to have a mean of zero and standard deviation of one (any statistical software will do these two steps for you). Finally, examine a histogram and/or qq-plot to assess how normal the distribution is.

The following is a histogram of standardized residuals for our model of candidate performance. The standardization is based on the standard deviation of the residuals using all of the values (later this post will discuss leave-one-out residuals). A true normal curve is also added to the figure as a reference.

The residuals are not perfectly normal, but they will never be. Given our sample size (\(n = 406\)) and the lack of enormous skew or peakedness to the distribution, we can have greater confidence in our hypothesis tests.

Another popular graphical display of residuals is a qq-plot, which compares the quantiles of a true normal distribution against the actual quantiles of the residuals. The resulting points will follow perfectly along the diagonal of the plot if the standardized residuals are exactly normal (or distributed \(t\), depending on the distribution your software uses for comparison). The following shows the qq-plot for our model.

There is some snaking around the diagonal line, and the higher residuals in particular tend to be greater than their theoretical value under true normality. At the same time, in finite samples we expect some deviations from the true normal distribution, and these deviations are not large (you’ll know a bad qq-plot when you see it!). We again proceed treating the normality assumption as met.

Many people are uncomfortable with a purely visual assessment of normality and want a test to provide a pass/fail verdict. Such tests do exist, such as the Kalmogorov-Smirnov or Shapiro-Wilk tests, but these tend to be too pessimistic and provide unnecessary worry to students working with linear models for the first time. In reasonably sized samples, the assumption of normality is less consequential for inferences. At the same time, graphical examinations of normality can point you towards reassessing the linearity assumption, transforming variables, or identifying outliers.

### Linearity

The linearity assumption is primarily relevant for interval-level (i.e. non-dummy) variables. It is the assumption that the effect of an independent variable can be summarized with a straight line. An easy way to begin assessing this assumption is by plotting each independent against the dependent variable. Most statistical software allow you to add several different types of fit lines for comparison. A particularly helpful fit line is known as a *loess* fit, which is based on predictions from a series of locally weighted regressions across the range of the predictor. The following figure shows such a plot for the `mshare`

independent variable.

We want the plot to show a straight (i.e. *linear*) loess line, which is what we see in the prior figure. The same figure can be used for the other interval-level variable in the model, percentage of whites in the Congressional district.

The line appears pretty straight until higher levels of percent white, at which point it levels off.

Both figures represent a good starting point for looking at our data, and they suggest that linearity is appropriate for our interval-level predictors. However, the shortcoming of these figures is that they do not account for other variables in the model. We consequently need to consider a few more figures.

First, we will plot the residuals against the observed values of the predictor (separate graphs for each independent variable). We would have evidence for non-linearity if the residuals tend upwards, or downwards, or snake back and forth. The following figure shows the results of plotting `mshare`

against the residuals.

The residuals look randomly distributed in the figure, so we are still confident linearity applies for the Tweet share variable. The next plot is for the `pct_white`

variable.

This figure shows a U-shaped relationship between percent white and the residuals, which suggests that there may be nonlinearity in association between `pct_white`

and `vote_share`

. We may wish to test a model in which the effect of `pct_white`

is specified as quadratic.

An alternative figure that may be more informative is the *added variable plot*, also called a *partial regression plot*. These figures show the relationship between \(x\) and \(y\) after removing the effects of the other variables. First, \(y\) is regressed on all of the \(x\)s except for \(x_j\), and the residuals are saved. Next, \(x_j\) is regressed on all of the other \(x\)s, and the residuals are saved. These residuals are then plotted against each other. The interpretation is the relationship between the part of \(x_j\) that is independent of the other \(x\)s and the part of \(y\) not explained by the other \(x\)s.

The following figure, created using the `car`

package in R, shows the added variable plots for all predictors (including the categorical ones).

Unfortunately the software output defaults to adding a straight line, rather than a loess line, which makes it a bit more difficult to see nonlinearities. In the graph of `pct_white | others`

you can see that the straight line tends to yield predictions of `vote_share | others`

that are too high, suggesting some possible nonlinearities. However, the overall relationship looks more linear than the previous figure that did not account for the other variables in the model, and hence we may choose to continue with the linear specification.

Yet another option for examining the functional form is a *partial residual plot*, also called a *component plus residual plot*. In a partial residual plot, the \(y\)-axis represents the residual plus the linear component of the respective independent variable, \(e_i^{(j)} = e_i + b_jx_{ij}\), for the \(i\)th subject and the \(j\)th predictor. A separate plot is again made for each independent variable. These figures can also help identify outliers (discussed below), and we can also plot the distribution of component plus residuals across levels of our categorical variables with boxplots. The `cfPlots`

function from the `car`

package in R produces both. Example output is the following:

The figures include both a dashed, straight line indicating a perfectly linear relationship and a solid loess line. The `pct_white`

result deviates from the straight line in a manner that indicates some type of transformation may be beneficial. The variable is negatively skewed, hence it may make more sense to flip the scale (becoming percent non-white) and taking the log.

Getting the correct functional form is a matter of substantive theory and the goals of the model. A polynomial or logarithmic specification may better reflect the expected association between \(x\) and \(y\) if, for example, it is expected that \(x\) has a larger or smaller effect on \(y\) at higher values. In addition, getting the functional form correct may improve model fit as summarized with the \(R^2\). On the other hand, if the goal is simply to come up with the best linear approximation to the outcome, or if over-fitting to the sample is a concern, the analyst may choose to treat everything as linear even if there is evidence for nonlinearity.

### No Heteroskedasticity

*Homoskedasticity* refers to the equality of the variance of residuals. In other words, the model does not do a better job of predicting the outcome for some observations than others. *Heteroskedasticity* is when this is not the case. Correct determination of statistical significance requires homoskedasticity. In addition, violations of homoskedasticity are another indication that the model is not a good fit for the data, including but not limited to cases where the linearity assumption should be abandoned. Heteroskedasticity may also indicate that the model is missing an important predictor that should be added.

Once again, graphical summaries are a good means of assessing whether heteroskedasticity is a problem. A common starting point is a *residuals versus fitted plot*. In this figure, the standardized residuals are plotted on the \(y\)-axis, and the predicted values from the model are plotted on the \(x\)-axis. From our model we get the following:

What we hope to see is pure noise, which is indeed what the graph shows. A violation would occur if we noted that the residuals tended to clump closely around the zero line in one part of the graph and widely around zero in another part. The *dreaded trumpet* is an example of a heteroskedastic plot that occurs when the spread around zero increase as the fitted values increase

A variation of the residuals versus fitted plot is the *scale-location plot*, which has the square root of the absolute value of the standardized residuals on the \(y\)-axis.

We hope not to see evidence of the variance decreasing (or increasing) across the range of the fitted values. We are satisfied again that this is the case.

There is also a statistical test that can be performed to determine if there is evidence for heteroskedasticity. The *Breusch-Pagan test* is based on regressing the squared and standardized residuals on the independent variables in the model and using the sums of squares to estimate a chi-square distributed variable. If the value of the chi-square exceeds the threshold (equivalently, if \(p < 0.05\)), we reject the null hypothesis that the residuals are homoskedastic. That is, for this test we prefer a non-significant result.

Running the Breusch-Pagan test on our model we get \(\chi^2 = 33.87\), \(df = 5\), \(p < 0.001\).

We therefore have conflicting results. The plots did not concern us greatly, though in the section on linearity we did see that `pct_white`

may be better modeled with a nonlinear term. We can try re-running the model with an appropriate transformation and see if the Breusch-Pagan test improves. More generally, we can fit just about any regression model using so-called heteroskedastic-robust standard errors, also called Huber-White or “sandwich” standard errors. The intuition is that, because heteroskedasticity does not bias the coefficient estimates and only affects the standard errors, we can adjust our standard errors (and thus the \(p\)-values) to address the non-constant variance. Most software provides options to request robust standard errors.

What do we do if this assumption is violated? If you have not yet performed the tests, check the linearity assumption. Also consider adding in variables that may be relevant, if available, or transforming skewed variables in the model. You can use the heteroskedasticity-robust standard errors if these remedies still fail.

### Multicollinearity

Multicollinearity refers to the situation when one (or more) predictors are highly correlated with one (or more) other predictors. The intuition is illustrated in the following figure showing an outcome, \(y\), that is regressed on two independent variables, \(x_1\) and \(x_2\). Our regression estimate for \(x_1\) represents the area marked Blue, and the estimate for \(x_2\) is the area marked Green. The Orange and Red areas overlap, and the model cannot determine how much of the covariation with \(y\) gets attributed to either independent variable. When the two variables overlap substantially, such that the Blue and Green areas are small, our results may end up being non-significant because the *independent* effect of each predictor is too small. In extreme cases, we may get coefficient estimates that are enormous because of the lack of precision in the estimates. If the two variables overlap completely, the model will not be estimable at all.

In regression, we talk about *multicollinearity* because one predictor may covary substantially with a combination of multiple other variables. Think of regressing the predictor \(x_1\) on three other predictors in a multiple regression, \(x_2\), \(x_3\), and \(x_4\). A large \(R^2\) in this regression of \(x_1\) on the other \(x\)s would indicate substantial multicollinearity. That is, \(x_1\) may not have a large bivariate correlation with any of the other three by themselves, but together the other covariates have a substantial *multiple* correlation with \(x_1\).

In fact, the most common way to diagnose multicollinearity is through regressions of each independent variable on all of the others. Two statistics, the *Tolerance* and the *Variance Inflation Factor* (VIF) are typically evaluated. Tolerance is defined as:

\[ \text{Tolerance} = 1 - R^2_{x_j} \]

where \(R^2_{x_j}\) is the \(R^2\) from regressing \(x_j\) on the other \(x_{k \neq j}\) independent variables. Thus, there is a separate value for each of the \(k\) predictors. The VIF is defined as

\[ \text{VIF} = \frac{1}{1 - R^2_{x_j}} = \frac{1}{\text{Tolerance}} \]

The VIF gets its name because it is a measure of how much the variance of regression estimates has been increased due to multicollinearity. Often, researchers prefer to consider the square root of the VIF, which is interpretable as how much the standard error is inflated. For example, an independent variable whose VIF is 4.25 has its standard error more than doubled (\(\sqrt{4.25} = 2.06\)) due to multicollinearity.

Another approach sometimes used for assessing multicollinearity is to perform a type of Principal Components Analysis (PCA) to assess how much overlapping variability there is in the predictors. On the basis of the PCA we get a set of \(k\) eigenvalues, where \(k\) equals the number of independent variables in our model. The *condition index* is the square root of the ratio of the largest eigenvalue to each successive eigenvalues.

Like most regression diagnostics, interpretation of multicollinearity diagnostic statistics is more of an art than a science. Generally, a VIF greater than eight (some say ten) indicates a problem. The developers of SPSS recommend that condition indices not exceed 15, with major problems occurring at 30. If you find results approaching these values, but your coefficient estimates are plausible and significant, you probably do not need to worry (though you may want to verify that the variables in your model truly are all separate concepts).

The tolerance, VIFs, and square root of the VIFs are the following for our model:

Variable | Tolerance | VIF | Square Root VIF |
---|---|---|---|

Percent Republican Tweets | 0.729 | 1.373 | 1.172 |

Republican Incumbent | 0.620 | 1.614 | 1.270 |

% White Voters | 0.641 | 1.561 | 1.249 |

Middle McCain Support Tertile | 0.475 | 2.104 | 1.450 |

Top McCain Support Tertile | 0.427 | 2.344 | 1.531 |

Note that there are two rows for each McCain Support tertile dummy. The *generalized VIF* is an extension of the VIF that provides a single value for the McCain Support, accounting for the fact that it is really a single variable that, by being nominal, has to be represented by more than one dummy variable. The generalized VIF is similarly interpreted and is equal to the VIF for non-nominal variables. When the square root is calculated it is necessary to account for the degrees of freedom, which will be the number of dummies required for that variable.

Variable | GVIF | df | \(GVIF^{1/2df}\) |
---|---|---|---|

Percent Republican Tweets | 1.373 | 1 | 1.172 |

Republican Incumbent | 1.614 | 1 | 1.270 |

% White Voters | 1.561 | 1 | 1.249 |

McCain Support Tertile | 2.054 | 2 | 1.197 |

The following table shows the condition indices:

Dimension | Eigenvalue | Condition Index |
---|---|---|

1 | 4.112 | 1.000 |

2 | 1.042 | 1.987 |

3 | 0.479 | 2.929 |

4 | 0.197 | 4.574 |

5 | 0.150 | 5.235 |

6 | 0.020 | 14.264 |

None of these statistics raise red flags.

If you do find multicollinearity to be a problem, it may be due to one or more variables that are badly coded or that do not vary much (e.g. for a sample of \(n = 435\), a dummy variable that equals one for two subjects and zero for the remaining 433). Alternatively, it may be that you have multiple variables indicating the same thing, and it is a good idea to convert them to a single scale. In some models, such as regressions with interactions between continuous predictors, it may help to convert the variables to z-scores and re-run the model.

### Outliers

Outliers are values that are very far from the rest of the data; they can be either on \(x\), on \(\hat{y}\), or on both. Note that we say \(\hat{y}\) and not \(y\), since we are concerned about values far from the *predicted* value. Not all outliers are a cause for concern, so it is important to identify what types we have in our data.

The following figures visualize different types of outliers. In addition, the figures show the regression line fit to the hypothetical data. The first figure shows an example of an outlier on \(x\).

The next figure shows an outlier on \(\hat{y}\):

Finally, we have an example of an outlier on both \(x\) and \(\hat{y}\):

It is this last case that is most problematic, and the figure shows why. When an observation is an outlier on both \(x\) and \(\hat{y}\), the slope of the regression line can be biased towards the outlier. The solid line is the regression fit to all of the data. The dashed line is the regression fit to all observations except that outlier.

We therefore need to come up with measures of “outlyingness”, and especially measures that quantify how much the regression estimates have been affected by outliers.

The statistic that summarizes whether we have an outlier on the \(x\) values is known as *leverage*. Each observation can be assigned a leverage value, \(h_{i}\), which indicates how much of an outlier the subject is from the centroid of all \(x_k\). We indicate the leverage value with an \(h\) rather than \(l\) because it comes from the *hat matrix*, \(H = \boldsymbol{X}(\boldsymbol{X}^{\prime}\boldsymbol{X})^{-1}\boldsymbol{X}\). Here \(\boldsymbol{X}\) is the matrix formed by the independent variables; the dependent variable is not involved. Hence, leverage is a measure of outliers among the independent variables, which we saw does not necessarily mean that high leverage observations are biasing the regression slope.

We also need a way to quantify how much of an outlier an observation is on \(\hat{y}\). A seemingly obvious way would be to just look at the residuals - the distance of each observed value from its prediction - like we did already above when we looked at the distribution of residuals. However, if an observation is a true outlier that pulls the regression line towards itself, that observation’s residual will be artificially too small. Instead, we can take a look at *studentized (deleted) residuals* which are the result of estimating the regression without that observation, generating the prediction, and then calculating the difference between the actual observed and new predicted value. Intuitively, we do this as many times as we have observations, fitting the model each time without the respective observation.

The following figure shows the leverage for each observation. The \(x\)-axis is simply the observation number. The data can be sorted in any order, as we are simply looking for values that stick out from the cloud of other observations.

There is one value that sticks out higher than the others, observation 112. A high leverage observation, however, may not necessarily affect the slope estimates unless it also yields a large studentized residual. Because the studentized (deleted) residuals are standardized (using the leave-one-out variance estimate), they should follow a \(t\) distribution and hence can be plotted as a histogram centered at zero.

We see most of these residuals fall between -2 and +2, like we would expect from a \(t\)-distributed random variable. On the low end there are a few larger outliers, but nothing that we would not expect given the distribution with \(n = 435\).

We have defined leverage (outliers on \(\boldsymbol{X}\)) and studentized (deleted) residuals (outliers on \(\hat{y}\) based on leave-one-out calculations). Yet we need both to determine if any cases are having a disproportionate influence on the results. Combining these leads to measures called *dfBetas*, *Cook’s d*, and *dfFits*.

The intuition for dfBetas is simple. Just like with studentized (deleted) residuals, we can effectively fit the model \(n\) times, once for each observation leaving the observation out. Then we determine the amount of change between the coefficient with all of the data versus the coefficient after deleting observation \(i\). These are then standardized by the coefficients standard error (based on the model without \(i\)). For \(k\) variables and \(n\) observations, we end up with \(kn\) values. This is a lot of information, so we can rely on graphs. The following graph has the ID again on the \(x\)-axis and the absolute value of dfBeta on the \(y\)-axis. The lines are colored based on the variable, and values greater than 0.3 are labeled.

We see observations 112 and 257 are affecting the Intercept and Percent White estimates, while observation 107 affects the McCain Top Tertile estimate and observation 119 is affecting the Tweet share estimate.

Cook’s distance, or Cook’s *d*, combines outliers on \(\boldsymbol{X}\) (leverage) and outliers on \(\hat{y}\) into a single value for each observation. That is, there are \(n\) Cook’s *d* values compared to the \(nk\) values produced when considering dfBetas. The following graph again shows the observation number along the x-axis but now with Cook’s *d* on the y-axis. We look for observations that stand out from the rest. The following figure labels observation numbers that are greater than 0.02.

The results show the same cases that were problematic in the prior plot.

A final measure of outliers is *dfFit* (“difference in fit”), which compares the predicted values \(\hat{y}\) from the model fit with all \(n\) observations to the \(\hat{y}_{-i}\) prediction from the model fit without the \(i\)th observation. The value is then scaled using the leave-one-out model mean square error multiplied by the leverage. The following replicates the Cook’s *d* plot but with the absolute value of the dfFits on the y-axis. Values greater than 0.35 are labeled with the observation number.

The cut-offs the prior graphs used for labeling cases was arbitrary. Some suggestions for cut-offs to label outliers can be found in the literature, but the problem with using such heuristics is that the nature of the outlying cases needs to be explored qualitatively before deciding how to proceed. The first step is to check to make sure that the outlier has all of its variables correctly coded. Mistakes in coding (such as forgetting that -999 represents a missing value) can be a cause that is easily remedied by fixing the data. If the value is correct, it is necessary to determine if the case may be too different from the rest of the sample to be representative of the population of interest. The observation can then be dropped if such a determination is made.

Most commonly, however, the variable is coded correctly, and the observation belongs in the sample. The best way to proceed is to leave the observation(s) in, but run the model without the outlier(s) to determine how sensitive the results are. In very large data files, an outlier probably will not matter much. In smaller samples, however, care must be taken to avoid over-interpreting a result that is driven by one or two cases.

A final note on outliers. This discussion has focused on the effect an outlier can have on regression estimates. High leverage by itself does not necessarily mean estimates will be biased, as there must also be a large discrepancy in the studentized (deleted) residuals. However, high leverage *can* have an impact on inferences by making standard errors smaller (and hence yielding smaller \(p\)-values). This is because the standard error for an estimate is calculated based in part on the variance in \(x_k\),

\[ SE_{\beta_k} = \frac{SE_r}{\sqrt{(x - \bar{x})^2}} \] The denominator is the variance of the independent variable. By making the denominator larger, the standard error gets smaller and causes us to increase our confidence in the estimate. In this manner, high leverage can lead inferences astray even without considering outlying \(\hat{y}\) values.

### Independence

Independence means that one observation should not be related to any other observation. This is commonly violated with time series data (i.e. what I did yesterday can predict what I do today) and with clustered data (i.e. in a study combining multiple classrooms, where students in the same classroom have similar experiences). There are tests available for violations of independence, but it is typically the research design (e.g. cluster sampling or longitudinal analysis) that determines whether or not these are concerns to address. Cluster-robust standard errors exist for clustered data that improve inferences, but generally non-independence is best addressed through statistical methods that are different from multiple regression.