# Running t-Tests in R

Nikki Kamouneh

Posted on
t distribution t test R

## t-Tests in R

All three types of $$t$$-tests can be performed using the same t.test function in R. The primary arguments are the following:

• x and (optionally) y, or a formula, e.g. y ~ x. These specify the interval-level outcome variable y and the two-level factor variable x. The formula syntax can be used for the independent samples $$t$$-test. If a formula is specified, the data argument can be specified so that it is not necessary to specify the data frame using df$x and df$y notation.
• alternative, which specifies if a two-tailed test will be used (the default), or a one-sided test.
• mu, which is the null hypothesized difference between means. This can be set for the one-sample test (see example below) but is usually left at its default value of 0 for differences in means (paired or independent).
• paired, which specifies, when two means are compared, whether the observations are paired or independent. The default is paired = FALSE, or the independent samples $$t$$-test.
• var.equal is set for independent samples $$t$$-tests to determine if an adjustment should be made for unequal variances between the groups. It defaults to FALSE, meaning equal variances are not assumed.
• conf.level, the confidence level. By default this is 0.95, corresponding to $$\alpha = 0.05$$.

The data used in this tutorial can be downloaded from this GitHub repository. The one-sample and independent samples examples will use the iq_long.sav data, and the paired samples example will use iq_wide.sav. These are SPSS files and can be read in using the haven package. Assuming the data are saved in a local folder data inside the current working directory, the following syntax can be run:

library(tidyverse)
library(haven)
library(knitr)
library(broom)

mutate(gender = as_factor(gender))

iq_wide <-read_sav("data/iq_wide.sav")

Notice that we load four packages that will be used in this tutorial, tidyverse, haven, knitr, and broom. Also note that read_sav automatically treats the gender variable as numeric. The mutate call makes sure R knows this is a factor (categorical) variable.

## One Sample $$t$$-Test

Say we have data from 200 subjects who have taken an IQ test. We know in the general population the mean IQ is 100. We want to test the hypothesis that our sample comes from a different population, e.g. one that is more gifted than the general population. We will first look at the distribution of scores to determine if there are any outliers or if the distribution is highly skewed. Then we will test the null hypothesis that our sample comes from a population where $$\mu \neq 100$$.

First, let’s use ggplot to look at our data using a histogram.

iq_long %>%
ggplot(aes(x = iq)) + geom_histogram(color = "black", fill = "firebrick")
## stat_bin() using bins = 30. Pick better value with binwidth.

## Independent Samples $$t$$-Test

Say we wanted to test whether there is a significant difference in the IQs of males and females in our sample of 200 subjects. We’ll start out by visualizing the differences between groups using boxplots. This will give us an initial sense of whether differences exist and allow us to look for major outliers, skew in the distributions, or dramatically unequal variability between the two groups.

iq_long %>%
ggplot(aes(x = gender, y = iq)) +
geom_boxplot(color = 'black', fill = 'firebrick') +
labs(x = "Gender", y = "IQ")

## Paired Samples $$t$$-Test

Finally, say we have IQ data collected on 100 individuals at two points in time. We want to know if an intervention that occurs in between the measures - say forming a test study group - increases IQ scores. The null hypothesis is that the mean change ($$IQ_{t2} - IQ_{t1}$$) is zero.

To conduct a dependent (or paired) samples $$t$$-test in r, the data must be in wide format. That is, the $$t_1$$ measures are in one column, the $$t_2$$ measures are in another, and each row represents one subject.

head(iq_wide) %>%
kable(align = rep("c", 3))
id Time_1 Time_2
1 93.89 83.71
2 131.22 116.23
3 102.80 110.93
4 107.27 95.90
5 89.94 101.37
6 104.17 99.24

First, we’ll visualize the differences between the two time points. However, this requires the data be in long format ($$t_1$$ is stacked on $$t_2$$ in a single column). We can use the gather function from the tidyr package.

iq_for_graph <- iq_wide %>%
gather(Time, IQ, Time_1:Time_2) 

The first argument names the new column that will contain the original variable names (Time_1 and Time_2). The second argument provides the name of the column that will contain the values. The remaining arguments simply name the variables from the wide format data that will be stacked. The output looks like this:

head(iq_for_graph) %>%
kable(align = rep("c", 3))
id Time IQ
1 Time_1 93.89
2 Time_1 131.22
3 Time_1 102.80
4 Time_1 107.27
5 Time_1 89.94
6 Time_1 104.17

Now we’ll create the graph.

iq_for_graph %>%
ggplot(aes(x = Time, y = IQ)) +
geom_boxplot(color = "black", fill = "firebrick")