# Description of Common Regression Diagnostic Tests

Posted on
regression Diagnostics

An important part of regression modeling is performing diagnostics to verify that assumptions behind the model are met and that there are no problems with the data that are skewing the results. This tutorial builds on prior posts covering simple and multiple regression as well as regression with nominal independent variables. The same data will be used here.

The variables used in this tutorial are:

• vote_share (dependent variable): The percent of voters for a Republican candidate.
• mshare (independent variable): The percent of social media posts for a Republican candidate.
• pct_white (independent variable): The percent of white voters in a given Congressional district.
• mccain_tert (independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.
• rep_inc (independent variable): Whether the Republican candidate was an incumbent or not.

The mccain_tert will be treated as a categorical predictor and hence will be entered into the regression model as dummy variables with the lowest tertile used as the reference category. Likewise, rep_inc is a categorical variable that takes on two values, 1 = incumbent, 0 = non-incumbent (reference category).

Take a look at the first six observations in the data:

Vote Share Tweet Share Republican Incumbent McCain Tertile Percent White
51.09 26.26 Republican not Incumbent Top Tertile 64.2
59.48 0.00 Republican Incumbent Top Tertile 64.3
57.94 65.08 Republican not Incumbent Top Tertile 75.7
27.52 33.33 Republican not Incumbent Bottom Tertile 34.6
69.32 79.41 Republican Incumbent Top Tertile 66.8
53.20 40.32 Republican not Incumbent Top Tertile 70.8

If we run the linear regression, we get the following model:

Term Estimate SD t-statistic p-value
(Intercept) 13.229 1.487 8.893 < 0.001
Percent Republican Tweets 0.041 0.012 3.504 < 0.001
Republican Incumbent 12.322 0.867 14.205 < 0.001
% White Voters 0.292 0.023 12.768 < 0.001
Middle McCain Support Tertile 11.882 0.964 12.332 < 0.001
Top McCain Support Tertile 18.721 1.103 16.975 < 0.001

All of our estimates are significant. Now it’s time to assess the assumptions in our model. The core assumptions underlying multiple regression are:

• Normality of Residuals
• Linearity
• Homoskedasticity
• No Perfect Multicollinearity
• No Outliers
• Independence of Observations

The remainder of this post explains each of these, the consequences of their violation, how to assess if the assumption is met, and what to do if any is violated.

### Normality of Residuals

Recall that a residual is the difference between the actual and predicted values of the dependent variable. The statistical theory underlying hypothesis tests assumes that the distribution of the residuals is normal, though in relatively large samples the Central Limit Theorem kicks in and violations from normality are less consequential.

On the other hand, non-normality can signify other problems with our model specification. For example, it may be evidence that the assumption of linearity is violated for one or more independent variables (discussed below). Alternatively, highly skewed residuals may be indicative that one of the variables in the model itself is highly skewed and would benefit from a transformation.

Another reason to explore the distribution of residuals is because they can sometimes reveal outlying observations. A residual more than two or three standard deviations from the mean (zero) may represent an observation that is having disproportionate influence on the model estimates. The subsection below on outliers explains more comprehensive ways to assess potential outlier problems, but the problem may first appear in your assessment of the distribution of the residuals.

The easiest way to assess normality is through the use of graphs. First, calculate the residual for each subject as the difference between the observed value and the model predicted value. Next, standardize the residuals to have a mean of zero and standard deviation of one (any statistical software will do these two steps for you). Finally, examine a histogram and/or qq-plot to assess how normal the distribution is.

The following is a histogram of standardized residuals for our model of candidate performance. The standardization is based on the standard deviation of the residuals using all of the values (later this post will discuss leave-one-out residuals). A true normal curve is also added to the figure as a reference.