One-Way ANOVA (Analysis of Variance)

Nikki Kamouneh

Posted on
t distribution t test anova f test

You should recall from our previous tutorials how to conduct a two sample t-test to compare two means. However, what do we do when we have more than two groups? ANOVA (Analysis of Variance) is how we can make these comparisons.

The Logic of ANOVA

The total variability we see in our dependent variables can have two sources.

• Within-groups variance: what is the variability of the dependent variable inside a particular group?
• Between-groups variance: how much variability is caused because a subject is in group A and not group B or C?

Using ANOVA, we compare the ratio of between-groups variability to within-groups variability. If this ratio is large, we conclude that the groups are significantly different. If they were not, then there would be no variability due to anything above and beyond differences at the individual level. We can visualize this below with the plot changing from no between groups variability to having between groups variability.

Calculating F

To calculate our F statistic, we need to first find the two types of variance.

Recall the formula for finding the variance of a sample.

$s^2 = \frac{\sum_{i = 1}^N (x_i-\bar{x})^2}{N-1}$

The numerator is referred to as the Sum of Squares. The term in the parentheses is “error”. This comes from treating the mean $$\bar{x}$$ as a “best guess”, which means that the difference between the observed value and the mean is how large our “error” in guessing is.

The denominator we will call the degrees of freedom. To determine if group membership matters, we have to calculate the amount of variability between groups and compare it to the variance within groups. The sum of squares will get us there. We can write:

$s^2 = \frac{SS}{df}$ So, of course, the sample standard deviation is just:

$s = \sqrt{\frac{SS}{df}}$

The between-groups sum of squares refers to the (weighted) difference between each group mean and the overall mean.

$SS_B = \sum_{j = 1}^k n_j(\bar{x_j}-\bar{x})^2$

Note that we have $$k$$ groups. We use the subscript $$j$$ to refer to a specific group. So if $$j = 1$$, it is the first of all $$k$$ groups.

Consequently, this should look a little like the sum of squares we used for the variance, but there are two changes in notation.

1. We subtract the grand mean $$\bar{x_j}$$ from each group mean $$\bar{x}$$. The grand mean is the mean of all subjects in all groups.
2. We weight each group by the sample size for the group.

Different samples give us more information if they have a larger sample size, and hence each sample is weighted by $$n_j$$. The between groups variance is therefore telling us how much the means vary.

• When the means vary a lot, then $$SS_B$$ is larger.
• When the means do not vary, then $$SS_B$$ is small.

The within-groups sum of squares is:

$SS_W = \sum_{j = 1}^k \sum_{i=1}^{n_j}(x_{ij}-\bar{x_j})^2$

The double summation simply means that we first get the sum of squares from each of the $$k$$ groups separately, then we add them together. Think of this as calculating the numerator of the variance for each group separately, then adding them together. We are talking about overall variability in our observations. When we calculated the variance previously, we divided the sum of squares by the degrees of freedom. We can do that here as well. When we divide the between and within sums of squares by the appropriate degrees of freedom, the result is called Mean Squares. First, the mean square between estimate divides by the number of groups minus one.

$MS_B = \frac{SS_B}{df_B}$

where $$df_B = k - 1$$.

The mean square within estimate divides by the total sample size N minus the number of groups.

$MS_W = \frac{SS_W}{df_W}$

where $$df_W = N - k$$.

Our $$F$$ statistic, which we use to determine if the between-groups variability is sufficiently large to assert that treatment affects the dependent variable, is calculated as the ratio of $$MS_B$$ to $$MS_W$$.

$F = \frac{MS_B}{MS_W}$

The resulting $$F$$ statistic is then compared to the $$F$$ distribution. In repeated samples, if we calculate the between/within ratio for each sample, the results would be distributed $$F$$.

Like t, the shape of F is determined by the degrees of freedom. Unlike t, F actually is determined by two different degrees of freedom.

Notes about the $$F$$ Distribution

• It can only take on positive values; variances are never negative (recall we square the numerator).
• Since the lower bound is set at zero, we only do a one-sided test. We set $$\alpha$$ as the area under the curve in the right side of the distribution.
• When we use a table to look up critical values, we need to find the critical value given both degrees of freedom.
• In other words, significance is determined both by how many subjects we have and by how many different groups we divide them into.

Steps to Doing an ANOVA

1. Assert the null hypothesis: all means are equal.
2. Calculate the mean squares between.
3. Calculate the mean squares within.
4. Calculate $$F$$.
5. Compare to F distribution with $$df_B$$ and $$df_W$$.
6. An F in the tail of the distribution means reject the null hypothesis.

Example

In an experiment to compare the cooking times of four different brands of pasta, five boxes of each brand (A-D) were selected and the cook time (in minutes) of each was recorded.

A B C D
9 15 14 17
6 12 17 15
8 16 11 12
11 8 15 22
9 9 19 15

We will set our hypotheses as:

• $$H_0: \mu_A = \mu_B = \mu_C = \mu_D$$
• $$H_A:$$ At least two $$\mu_i$$’s are unequal

First, we will visualize the data:

Contrasts

So we’ve found F to be significant. Great! One problem: We do not know which of the differences are significant. We need to dig a little deeper and make possibly multiple pairwise comparisons. The problem is that conducting more tests increases the likelihood of a Type I error. If you do enough tests, you will eventually get a significant result due simply to random variability. It gets even trickier. The manner in which we control for our error rates will depend on:

• Whether we know in advance which comparisons we want to make.
• Whether we decide to make comparisons after we see our $$F$$ statistic.

Planned Comparisons

Let’s start out with a planned contrast. We use the term contrast to describe how we compare different groups. Contrasts can be simple, meaning we compare the mean of one group to the mean of another ($$H_0: \mu_1 = \mu_2$$), or they can be complex, meaning that we compare one mean (or a combination) to any combination of other means ($$H_0: \frac{1}{2}(\mu_1 + \mu_2) = \mu_3$$).

Testing contrasts requires creating contrast weights, $$c_j$$. To determine weights:

• List each treatment level.
• For each treatment not considered, assign a zero weight.
• For each treatment on one side of comparison, assign numbers that sum to one.
• For each treatment on the other side of comparison, assign numbers that sum to negative one.

Whichever contrast we decide on, we use the weights to derive our test statistic.

Let $\psi = \sum_j^k c_j\mu_j = c_1\mu_1 + c_2\mu_2 + ... + c_k\mu_k$

for $$k$$ treatment levels, with $$c_j$$ being the weight assigned to each group. We do not know the population values, so we use our estimates of $$\mu$$.

$\hat\psi = \sum_j^k c_j\bar{x}_j = c_1\bar{x}_1 + c_2\bar{x}_2 + ... + c_k\bar{x}_k$

The $$\hat\psi$$ (read “psi-hat”") means that the value of the summation is based on estimated parameters (means). It is the estimated value of the contrast. Dividing by its standard error yields a $$t$$-statistic with degrees of freedom equal to $$df_W$$.

The formula for the standard error is: $SE = \sqrt{MS_W \sum_j \frac{c_j^2}{n_j}}$

where $$MS_W$$ is the denominator from the ANOVA $$F$$ ratio. To determine if the comparison is significant, calculate

$t = \frac{\hat\psi}{\sqrt{MS_W \sum_j \frac{c_j^2}{n_j}}}$

If we’re doing multiple contrasts, we need to make an adjustment to the increased probability of a Type I error. Specifically, we’ll carry out a Bonferroni adjustment. With a Bonferroni adjustment, we take our nominal alpha value (that is, our acceptable level of risk of committing a Type I error) and divide it by the number of contrasts that we are carrying out.

For example, if we set 0.05 as our $$\alpha$$ level, and we carry out 5 contrasts, we would require our $$F$$ or $$t$$ statistic to have a p-value less than $$\frac{0.05}{5}= 0.01$$.

Example

Let’s take another look at our pasta example from before. Say we want to compare groups A and B to groups C and D. We will set up a planned contrast, and then conduct a $$t$$-test to see if there is any difference between the sets of groups. Our hypotheses will be:

• $H_0 : (_A + _B) = (_C + _D)$
• $$H_A : \frac{1}{2}(\mu_A + \mu_B) \ne \frac{1}{2}(\mu_C + \mu_D)$$

For each group being tested, we assign a weight. The weights on each side of the equation must sum to 1 and -1. We can use the following table of weights for this contrast:

Group Weight
A 0.5
B 0.5
C -0.5
D -0.5

Now we have to find $$\hat\psi$$. We will use the following equation:

$\hat\psi = c_A\bar{x}_A + c_B\bar{x}_B + c_C\bar{x}_C + c_D\bar{x}_D \\ = 0.5*8.6 + 0.5*12 - 0.5* 15.2 - 0.5*16.2 \\ = -5.4$

Next, we will find the standard error:

$SE = \sqrt{MS_W \sum_j^k \frac{c_j^2}{n_j}}$

where,

$\sum_j^k \frac{c_j^2}{n_j} = 4(\frac{0.25}{5}) = 0.2$ and $$MS_W = 9.675$$ from before, so,

$SE = \sqrt{9.675*0.2} = 1.935$

To determine if the comparison is significant, we calculate

$t = \frac{\hat\psi}{SE} = \frac{-5.4}{1.935} = -2.79$ We will compare this to a critical $$t$$ value of $$df_W$$ (16), which is $$\pm 2.12$$. Since our result is outside of that range, we have a significant result and we reject the null hypothesis.

Post Hoc Comparisons

Perhaps it is more likely that you do not have any a priori expectations about differences in means. In this case, you carry out post hoc comparisons. Say you wanted to check all pairwise comparisons. That is, you want to compare every mean to every other mean. Obviously, you run an elevated risk of committing a Type-I error. There are many different approaches to maintaining the desired $$\alpha$$ level. Two widely taught methods are Tukey’s WSD and the Scheffe test.

Among the most common approaches to post hoc pairwise comparisons is Tukey’s WSD (Wholly Significant Difference) test. Also known as Tukey’s HSD (Honestly Significant Difference) test. Tukey relies on something called the studentized range statistic, usually denoted by $$q$$. This statistic has its own sampling distribution, which we elide here. Just know that, by altering the sampling distribution used to declare significance, we maintain our desired $$\alpha$$ level despite the multiple contrasts.

The Scheffe test is also common, and more flexible than Tukey, since Tukey can only handle pariwise comparisons, while Scheffe can handle pairwise comparisons and complex post hoc contrasts. Scheffe makes an adjustment to the $$F$$ (or $$t$$) statistic that is used to declare a comparison statistically significant. The adjustment is based on $$df_B$$ for the entire study (that is, all groups, not just those involved in the comparison). While flexible, Scheffe is used less than Tukey for making all pairwise comparisons because its power is lower.

Effect Size

The final element of a one-way ANOVA to report is the effect size. The effect size for ANOVA is referred to as either $$\eta^2$$ of $$R^2$$. They are the same, but $$\eta^2$$ is more commonly written in the context of ANOVA and $$R^2$$ is commonly written in the context of regression. $$\eta^2$$ can range from 0 to 1. It is a summary of how much variability in the dependent variable we have explained with our nominal independent variable. If it is 0, we have explained nothing, and if it is 1 we have explained everything. Usually we will be somewhere in between, and probably closer to the bottom.

Calculation of $$\eta^2$$ is easy and follows intuitively once one knows the interpretation. First, we can easily get the total sum of squares: $SS_T = SS_B + SS_W$ Since $$\eta^2$$ is the proportion of total variability explained by the treatment, $\eta^2 = \frac{SS_B}{SS_T}$

What is a Big Effect?

According to Cohen:

• $$\eta^2$$ = .01 is a small effect.
• $$\eta^2$$ = .06 is a medium effect.
• $$\eta^2$$ = .14 is a large effect.

We can use these effect sizes to carry out a power analysis. Note now, though, that power will depend on the number of subjects and the number of treatment levels.