Estimating Regression Models with Binary Independent Variables

Posted on

In our previous tutorials, we discussed simple regression and multiple regression with continuous variables, but what happens when our independent variable is nominal rather than interval?

The data used in this tutorial are again from the More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior study from DiGrazia J, McKelvey K, Bollen J, Rojas F (2013), which investigated the relationship between social media mentions of candidates in the 2010 and 2012 US House elections with actual vote results. The authors have helpfully provided replication materials. The results presented here are for pedagogical purposes only.

Binary Independent Variables

First we will take a look at regression with a binary independent variable. The variables used are:

  • vote_share (dependent variable): The percent of voters for a Republican candidate
  • rep_inc (independent variable): Whether the Republican candidate was an incumbent or not

We will code an incumbent, a candidate who is currently in office, as one, and a non-incumbent as zero. Take a look at the first six observations in the data:

Vote Share Rep Incumbent
51.09 0
59.48 1
57.94 0
27.52 0
69.32 1
53.20 0

Plotting our observations, we see the points cluster together at the two possible values of the nominal variable.

scatterplot of binary variable

The model will be:

\[ \text{Vote Share} = a + b(\text{Republican Incumbent}) \]

Using our least squares criterion, we can fit a line that minimizes the sum of the squared residuals. We find that the line of best fit is:

\[ y = 42.12 + 25.40x \]

Adding the regression line to our plot we get:

scatterplot of binary variable with regression line

What does this mean? Recall that we coded rep_inc such that a non-incumbent is zero and an incumbent candidate is one. Thus, when we are talking about a non-incumbent:

\[ \begin{eqnarray} y &=& 42.12 + 25.40(0) \\ &=& 42.119 \end{eqnarray} \]

In other words, the predicted value of votes for a non-incumbent is equal to the \(y\)-intercept \(\alpha\). When we are considering an incumbent candidate:

\[ \begin{eqnarray} y &=& 42.12 + 25.40(1) \\ &=& 67.52 \end{eqnarray} \]

Let’s take a look at the descriptive statistics for vote share by the Republican incumbent variable:

## `summarise()` ungrouping output (override with `.groups` argument)
rep_inc N Min Max Mean Var SD
0 272 4.50 76.35 42.12 191.84 13.85
1 134 34.13 84.81 67.52 49.08 7.01

From our regression, we were able to recover the group means. This tells us that the value of \(b\) is the difference in means between an incumbent and a non-incumbent candidate. To determine the significance of this variable we use a test of the difference in means, which works out to be a \(t\)-test.

Significance Test for Binary Predictor

To get a test statistic for \(b\), we have to divide it by its standard error.

\[ SE_b = \sqrt{\frac{\sum(y-\hat{y})^2/(n-2)}{\sum(x-\bar{x})^2}} \]

To illustrate these calculations, consider the first six observations.

\(y\) \(x\) \(\hat{y}\) \(y-\hat{y}\) \((y-\hat{y})^2\) \(\bar{x}\) \(x-\bar{x}\) \((x-\bar{x})^2\)
51.09 0 42.12 8.97 80.55 0.33 -0.33 0.11
59.48 1 67.52 -8.04 64.57 0.33 0.67 0.45
57.94 0 42.12 15.82 250.17 0.33 -0.33 0.11
27.52 0 42.12 -14.60 213.27 0.33 -0.33 0.11
69.32 1 67.52 1.81 3.27 0.33 0.67 0.45
53.20 0 42.12 11.08 122.85 0.33 -0.33 0.11

The \(\bar{x}\) value is the mean of the zero-one coded incumbent variable. \(\bar{y}\) is the mean value of the outcome, Vote Share, calculated over all 406 observations. We can sum the columns over all observations and find:

\(\sum(y-\hat{y})^2\) \(\sum(x-\bar{x})^2\)
58,515.34 89.77

Then,

\[ SE_b = \sqrt{\frac{58,515.34/404}{89.77}}= 1.27 \]

We will divide the estimate by the standard error to get a \(t\)-statistic.

\[ t= \frac{25.40}{1.27}=20.00 \]

Evaluate this test statistic against a \(t\)-distribution with \(N − 2\) degrees of freedom. We find a \(p\)-value of \(\lt .001\). The mean vote share is significantly higher for incumbents than non-incumbents.

Polytomous Independent Variables

We often have variables like race, which have more than two categories. This scenario can also be handled with regression. To illustrate, the variables used here are:

  • vote_share (dependent variable): The percent of voters for a Republican candidate
  • mccain_tert (independent variable): The vote share John McCain received in the 2008 election in the district, divided into tertiles.

Although we think of the tertile as a single variable with three levels, we will actually have to recode it into what are called dummy variables. A dummy variable is a variable that takes on only one of two values. It is coded one if an observation belongs to a certain category and zero if the observation does not.

We can code a dummy variable for the bottom tertile where the variable equals one if the person is in it and zero if they are not.

We code a second dummy variable for the middle tertile where the variable equals one if they are in the middle and zero if they’re not.

Note that we do not need to code a third dummy variable for the top tertile, since if a given voter is zero for bottom and middle, they must be in the top. Whenever there are \(k\) categories, it is necessary to create \(k − 1\) dummy variables. To illustrate,

Vote Share Bottom Middle
51.09 0 0
59.48 0 0
57.94 0 0
27.52 1 0
69.32 0 0
53.20 0 0

The fourth observation in the data is a district whose McCain vote was in the bottom tertile. None of these observations were in the middle tertile, which means all but the fourth were in the top tertile.

To estimate a regression model with dummy variables, we add each of the \(k − 1\) dummy variables as separate predictors.

\[ \text{Vote Share}=a +b_1(\text{Bottom}) + b_2(\text{Middle}) \]

If we estimate the model, we get a separate \(b\) coefficient for each of the tertiles. Our estimates turn out to be:

\[ \hat{y} = 66.49 - 34.23(\text{Bottom}) - 10.35(\text{Middle}) \]

How do we interpret these numbers? First, if we want to make a prediction about a respondent in the top tertile, the equation becomes:

\[ \begin{eqnarray} \hat{y} &=& 66.49 - 34.23(0) - 10.35(0) \\ &=& 66.49 \end{eqnarray} \]

A respondent in the middle tertile:

\[ \begin{eqnarray} \hat{y} &=& 66.49 - 34.23(0) - 10.35(1) \\ &=& 56.14 \end{eqnarray} \]

A respondent in the bottom tertile:

\[ \begin{eqnarray} \hat{y} &=& 66.49 - 34.23(1) - 10.35(0) \\ &=& 32.26 \end{eqnarray} \]

Let’s compare this to the summary statistics for vote_share by tertile:

## `summarise()` ungrouping output (override with `.groups` argument)
mccain_tert N Min Max Mean Var SD
Bottom Tertile 144 4.50 51.08 32.26 105.20 10.26
Middle Tertile 151 37.14 74.20 56.14 75.84 8.71
Top Tertile 111 41.11 84.81 66.49 80.76 8.99

The least squares estimates can be used to recover the means in each group. The value of the coefficient on the dummy is the difference in mean between the respective group (bottom or middle) and the intercept. The intercept is equal to the mean for the excluded group, here the top tertile.

In other words, the coefficients are testing:

  • \(b_1\): Bottom tertile Districts are, on average, different from top tertile Districts in their voting habits
  • \(b_2\): Middle tertile Districts are, on average, different from top tertile Districts in their voting habits

In the case of each dummy variable, a comparison is being made against the category that was not turned into a dummy. The excluded category is therefore the reference category.

The null hypothesis for \(b_1\) and \(b_2\) is not that tertile has no effect. It is that the Bottom (\(b_1\)) or Middle (\(b_2\)) are not significantly different from the Top. If we wished to compare the Bottom to the Middle tertiles, we would have to recode the dummy variables such that one of those two is the reference.

If we wished to test the hypothesis that tertile in general matters, we would look to the \(F\) test. The regression \(F\) statistic is found similarly to the simple and multiple regression models from the prior tutorials:

\[ F = \frac{MSReg}{MSR} \]

Where

  • Regression Mean Square, \(MSReg\) = \(\frac{RegSS}{k-1}\)
  • Residual Mean Square, \(MSR\) = \(\frac{RSS}{n-k}\)

Note that \(k\) is the number of statistics estimated in the regression model. Here \(k = 3\) for two dummy variables plus an intercept.

\(RegSS\) is the regression sum of squares, given by:

\[ \sum(\hat{y}_i - \bar{y})^2 \]

The regression sum of squares sums over the squared difference between an observation’s predicted value (\(\hat{y}\)) and the overall average (\(\bar{y}\)).

\(y\) \(\hat{y}\) \(\bar{y}\) \((\hat{y}-\bar{y})\) \((\hat{y}-\bar{y})^2\)
51.09 66.49 50.5 15.99 255.74
59.48 66.49 50.5 15.99 255.74
57.94 66.49 50.5 15.99 255.74
27.52 32.26 50.5 -18.24 332.81
69.32 66.49 50.5 15.99 255.74
53.20 66.49 50.5 15.99 255.74
  • \(y\) is vote_share
  • \(\hat{y}\) is the predicted value
  • \(\bar{y}\) is the mean of vote_share
  • \((\hat{y} - \bar{y})\) is the predicted minus the mean value
  • \((\hat{y} - \bar{y})^2\) is the predicted minus the mean value squared

Summing over all 406 observations, the RegSS is:

\(RegSS\)
81,117.6

\(RSS\) is the residual sum of squares, given by:

\[ \sum(y_i - \hat{y})^2 \]

To illustrate, consider again the first six observations:

\(y\) \(\hat{y}\) \((y-\hat{y})\) \((y-\hat{y})^2\)
51.09 66.49 -15.40 237.15
59.48 66.49 -7.01 49.18
57.94 66.49 -8.56 73.24
27.52 32.26 -4.74 22.50
69.32 66.49 2.83 8.01
53.20 66.49 -13.29 176.64
  • \(y\) is vote_share
  • \(\hat{y}\) is the predicted value
  • \((y - \hat{y})\) is the actual minus predicted value
  • \((y - \hat{y})^2\) is the actual minus predicted value squared

We then sum the \((y - \hat{y})^2\) column to find the RSS is:

\(RSS\)
35,303.06

The null hypothesis is that the independent variables together do not explain any variability in the dependent variable.

\[ F = \left. \frac{81,117.6}{2} \middle/ \frac{35,303.06}{403} \right. = 463.00 \]

We can compare the calculated \(F\) statistic against an \(F\) distribution with degrees of freedom equal to \(k=3\) and \(N -k=403\). We find that the model is significant with a \(p\)-value of \(\lt .001\). In other words, we can reject the null hypothesis that there are no differences between McCain support tertiles on Republican vote shares.

Still have questions? Contact us!