Running t-Tests in R

Posted on
t distribution t test R

t-Tests in R

All three types of \(t\)-tests can be performed using the same t.test function in R. The primary arguments are the following:

  • x and (optionally) y, or a formula, e.g. y ~ x. These specify the interval-level outcome variable y and the two-level factor variable x. The formula syntax can be used for the independent samples \(t\)-test. If a formula is specified, the data argument can be specified so that it is not necessary to specify the data frame using df$x and df$y notation.
  • alternative, which specifies if a two-tailed test will be used (the default), or a one-sided test.
  • mu, which is the null hypothesized difference between means. This can be set for the one-sample test (see example below) but is usually left at its default value of 0 for differences in means (paired or independent).
  • paired, which specifies, when two means are compared, whether the observations are paired or independent. The default is paired = FALSE, or the independent samples \(t\)-test.
  • var.equal is set for independent samples \(t\)-tests to determine if an adjustment should be made for unequal variances between the groups. It defaults to FALSE, meaning equal variances are not assumed.
  • conf.level, the confidence level. By default this is 0.95, corresponding to \(\alpha = 0.05\).

The data used in this tutorial can be downloaded from here. The one-sample and independent samples examples will use the iq_long.sav data, and the paired samples example will use iq_wide.sav. These are SPSS files and can be read in using the haven package. Assuming the data are saved in a local folder data inside the current working directory, the following syntax can be run:

library(tidyverse)
library(haven)
library(knitr)
library(broom)

iq_long <- read_sav("data/iq_long.sav") %>%
  mutate(gender = as_factor(gender))

iq_wide <-read_sav("data/iq_wide.sav")

Notice that we load four packages that will be used in this tutorial, tidyverse, haven, knitr, and broom. Also note that read_sav automatically treats the gender variable as numeric. The mutate call makes sure R knows this is a factor (categorical) variable.

One Sample \(t\)-Test

Say we have data from 200 subjects who have taken an IQ test. We know in the general population the mean IQ is 100. We want to test the hypothesis that our sample comes from a different population, e.g. one that is more gifted than the general population. We will first look at the distribution of scores to determine if there are any outliers or if the distribution is highly skewed. Then we will test the null hypothesis that our sample comes from a population where \(\mu \neq 100\).

First, let’s use ggplot to look at our data using a histogram.

iq_long %>%
  ggplot(aes(x = iq)) + geom_histogram(color = "black", fill = "firebrick")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We’ll ignore the message that ggplot is defaulting to 30 bins for the histogram, since this is a reasonable choice for our data. The observations look like they may be centered above 100, and the distribution looks roughly symmetric. What are the mean and standard deviation?

iq_long %>%
  summarise(Mean = mean(iq),
            SD   = sd(iq)) %>%
  kable(align = c("c", "c"))
Mean SD
105.0351 15.78354

The call to kable produces a nicely formatted table in our output file.

Is this mean significantly different from 100? We can run a one-sample \(t\)-test to determine the answer. The syntax is the following:

t.test(iq_long$iq, mu = 100)
  • iq_long is our sample data, and iq is the variable we are testing.
  • mu = 100 means that we are testing our sample against a population mean of 100.

Leaving the other arguments as their defaults results in a two-sided test and confidence level of 0.95. If we run the code, we get the following output:

## 
##  One Sample t-test
## 
## data:  iq_long$iq
## t = 4.5115, df = 199, p-value = 1.099e-05
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
##  102.8343 107.2359
## sample estimates:
## mean of x 
##  105.0351

The output is a little ugly, we can get a nicer output if we pipe the output into the tidy function from the broom package.

t.test(iq_long$iq, mu = 100) %>%
  tidy() %>%
  kable()
estimate statistic p.value parameter conf.low conf.high method alternative
105.0351 4.511477 1.1e-05 199 102.8343 107.2359 One Sample t-test two.sided

The estimate column gives us our sample mean. The statistic column tells us that our \(t\)-statistic is equal to 4.511. When compared to a \(t\)-distribution with 199 degrees of freedom (from the parameter column), we get a \(p\)-value that is less than .001. We also get a 95% confidence interval around our sample mean of [102.83, 107.24]. Since the \(p\)-value that we found is less than 0.05, and the 95% confidence interval does not include 100, we reject the null hypothesis.

If we wanted to test the null hypothesis that our sample comes from a population with a mean of 103, we would run another \(t\)-test that changes the mu argument.

t.test(iq_long$iq, mu = 103) %>%
  broom::tidy() %>%
  knitr::kable()
estimate statistic p.value parameter conf.low conf.high method alternative
105.0351 1.823461 0.0697339 199 102.8343 107.2359 One Sample t-test two.sided

This gives us a p-value of 0.06971, so we would not reject the null hypothesis.

Independent Samples \(t\)-Test

Say we wanted to test whether there is a significant difference in the IQs of males and females in our sample of 200 subjects. We’ll start out by visualizing the differences between groups using boxplots. This will give us an initial sense of whether differences exist and allow us to look for major outliers, skew in the distributions, or dramatically unequal variability between the two groups.

iq_long %>%
  ggplot(aes(x = gender, y = iq)) + geom_boxplot() +
  labs(x = "Gender", y = "IQ")

Females look like their central tendency sits higher than the male median. The distributions are both similar (consistent with the equal variance assumption) and roughly symmetric. We can get the specific descriptive statistics for each gender as follows:

iq_long %>%
  group_by(gender) %>%
  summarise(Mean = mean(iq),
            SD   = sd(iq)) %>%
  kable(align = c("c", "c"))
gender Mean SD
Male 103.5565 13.88069
Female 106.4000 17.31133

The mean IQ for males is 104 (SD = 13.9), and the mean IQ for females is 106 (SD = 17.3). Are these differences statistically significant? The following syntax tests performs the independent samples \(t\)-test. Note that, for the independent samples \(t\)-test, we can use the formula syntax.

t.test(iq ~ gender, data = iq_long, var.equal = TRUE)

In this syntax, iq is the interval-level outcome variable, and gender is the two-level factor variable. By default, R will conduct a two-sided test at the 95% confidence level. Also by default, R will run the test using the version of the \(t\)-test that adjusts for unequal variance. We saw that our two groups had a similar variance, so we changed the var.equal argument to equal TRUE. We can leave it as the default FALSE if we wanted to be more conservative.

We’ll again run the syntax and pipe the results into the tidy function to get nicer output.

t.test(iq ~ gender, data = iq_long, var.equal = TRUE) %>%
  tidy() %>%
  kable()
estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
103.5565 106.4 -1.274893 0.203841 198 -7.24196 1.554876 Two Sample t-test two.sided

The first two esimtate columns give the means for each group. statistic is the value of the \(t\)-statistic. When evaluated against a \(t\)-distribution with 198 degrees of freedom (listed in the parameter column), we get a \(p\)-value of .204. This is greater than 0.05, so we fail to reject the null-hypothesis of no difference. We also see that the 95% confidence interval around the mean difference is [-7.204, 1.517]. Because this includes zero (equivalent to \(p < 0.05\)), we do not reject the null.

Paired Samples \(t\)-Test

Finally, say we have IQ data collected on 100 individuals at two points in time. We want to know if an intervention that occurs in between the measures - say forming a test study group - increases IQ scores. The null hypothesis is that the mean change (\(IQ_{t2} - IQ_{t1}\)) is zero.

To conduct a dependent (or paired) samples \(t\)-test in r, the data must be in wide format. That is, the \(t_1\) measures are in one column, the \(t_2\) measures are in another, and each row represents one subject.

head(iq_wide) %>%
  kable(align = rep("c", 3))
id Time_1 Time_2
1 93.89 83.71
2 131.22 116.23
3 102.80 110.93
4 107.27 95.90
5 89.94 101.37
6 104.17 99.24

First, we’ll visualize the differences between the two time points. However, this requires the data be in long format (\(t_1\) is stacked on \(t_2\) in a single column). We can use the gather function from the tidyr package.

iq_for_graph <- iq_wide %>%
  gather(Time, IQ, Time_1:Time_2) 

The first argument names the new column that will contain the original variable names (Time_1 and Time_2). The second argument provides the name of the column that will contain the values. The remaining arguments simply name the variables from the wide format data that will be stacked. The output looks like this:

head(iq_for_graph) %>%
  kable(align = rep("c", 3))
id Time IQ
1 Time_1 93.89
2 Time_1 131.22
3 Time_1 102.80
4 Time_1 107.27
5 Time_1 89.94
6 Time_1 104.17

Now we’ll create the graph.

iq_for_graph %>%
  ggplot(aes(x = Time, y = IQ)) + geom_boxplot(color = "black", fill = "firebrick")

If we were going to publish this, we’d probably want to take the time to clean up the x-axis tick labels by removing the underscore. We can do this directly in ggplot as follows:

iq_for_graph %>%
  ggplot(aes(x = Time, y = IQ)) + geom_boxplot(color = "black", fill = "firebrick") +
  scale_x_discrete(labels = c("Time 1", "Time 2"))

Looking at the figure, it looks like the \(t_2\) scores are a little higher than the \(t_1\) scores, though not by much. Is this difference statistically significant? We run the paired samples \(t\)-test using the wide format of the data as follows:

t.test(iq_wide$Time_2, iq_wide$Time_1, paired = TRUE) %>%
  tidy() %>%
  kable()
estimate statistic p.value parameter conf.low conf.high method alternative
3.5234 1.55743 0.1225596 99 -0.9655275 8.012327 Paired t-test two.sided

The first argument is the column containing the \(t_2\) measures, and the second argument is the \(t_1\) measures. Note that we can’t use the formula syntax here because R needs to know which \(t_1\) observations go with which \(t_2\) observations, and it can only do so if the data are in wide format. This means we have to use the syntax in which the data frame iq_wide is prepended to the variable name with the $ operator.

We see that the difference in means is 3.52, which results in a \(t\)-statistic equal to 1.56. Evaluating this against a \(t\)-distribution with 99 degrees of freedom, we get a (two-sided) \(p\)-value of .123, not enough to be statistically significant. The 95% confidence interval around the estimated mean difference is [-0.966, 8.012]. Since this interval includes zero, and because \(p\) > 0.05, we do not reject the null hypothsis.

Note that the paired samples \(t\)-test is equivalent to creating a difference score, \(D = IQ_{t2} - IQ_{t1}\), and then testing if the mean difference score is significantly different from zero in a one-sample \(t\)-test. To see this, first create the difference score:

iq_wide <- iq_wide %>%
  mutate(Difference = Time_2 - Time_1)

Now perform the one-sample \(t\)-test with the null hypothesis set to be \(\mu_D = 0\).

t.test(iq_wide$Difference, mu = 0) %>%
  tidy() %>%
  kable()
estimate statistic p.value parameter conf.low conf.high method alternative
3.5234 1.55743 0.1225596 99 -0.9655275 8.012327 One Sample t-test two.sided

Other than the method column in the output table, the results are identical to the prior table using the paired sample \(t\)-test.