Caleb Scheidel

Posted on

The first step of an analysis in R is to read in your data. Data files can come in many different formats, but are commonly in a plain-text rectangular format such as .csv, .tsv, or .txt. The readr package allows you to read in such files. Most of readr’s functions are related to turning flat files into a tibble object, which is modern R’s internal data format. A tibble can then be manipulated to create summary tables or plots, run statistical tests, or perform other common analysis tasks.

### Setup

The readr package is part of the tidyverse suite of packages, developed by RStudio. If you do not already have these packages installed to your R environment run the following:

install.packages("tidyverse")

Then load the readr package.

library(readr)

### General Usage

Most import functions in readr follow the same general syntax: read_*(file, ...). The function you use to read in your data depends on which file format you are working with. The most common types of files are delimited files and fixed width files. A file in delimited format uses a character to separate every column on each line. For example a .csv file uses a comma and a .tsv file uses a tab to separate each column. The read_delim() function is readr’s general function to read in any type of delimited file, such as a pipe (|) delimited file or a colon (:) delimited file. A file which has fields defined by fixed number of characters is known as a fixed width file. For instance, the first column could have a fixed width of 10 characters, the second column is 3 characters and the third is 12 characters, etc.

readr has functions that support the following file formats:

• read_csv(): comma separated (.csv) files
• read_tsv(): tab separated (.tsv) files
• read_delim(): general delimited files
• read_fwf(): fixed width files
• read_table(): tabular files where columns are separated by white space
• read_log(): web log files

As an example, we can use a dataset from the popular Tidy Tuesday challenge put on by the R for Data Science online learning community. The dataset from the week of 2019-02-12 is related to federal R&D spending by agency. The file is in the .csv format, so to read it in we will use read_csv(). The dataset is hosted on the Tidy Tuesday GitHub repo, so it can be read directly from there without downloading to your local machine.

fed_rd <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/fed_r_d_spending.csv")
## Parsed with column specification:
## cols(
##   department = col_character(),
##   year = col_double(),
##   rd_budget = col_double(),
##   total_outlays = col_double(),
##   discretionary_outlays = col_double(),
##   gdp = col_double()
## )
# view the first few lines
head(fed_rd)
## # A tibble: 6 x 6
##   department  year   rd_budget total_outlays discretionary_outl…        gdp
##   <chr>      <dbl>       <dbl>         <dbl>               <dbl>      <dbl>
## 1 DOD         1976 35696000000  371800000000        175600000000    1.79e12
## 2 NASA        1976 12513000000  371800000000        175600000000    1.79e12
## 3 DOE         1976 10882000000  371800000000        175600000000    1.79e12
## 4 HHS         1976  9226000000  371800000000        175600000000    1.79e12
## 5 NIH         1976  8025000000  371800000000        175600000000    1.79e12
## 6 NSF         1976  2372000000  371800000000        175600000000    1.79e12

Notice that the object returned from read_csv is a tibble, and the column types are already assigned. You can see the output of the function returned a message about how readr parsed the columns. This is because read_csv, like the other data import functions from readr, will guess the appropriate data type (e.g. character, numeric, integer, date) for each column automatically upon reading in the datafile, based off of the values each column contains. In most cases, the assumptions that readr makes are sufficient, but sometimes you may need to manually specify a column type via the col_types = ... argument.

In this case, most column types were correctly assumed, but what if we preferred year to be treated as a date, not a double? Let’s read the file in again, this time setting year = col_date() in the col_types argument to override the default assumptions of readr.

fed_rd <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/fed_r_d_spending.csv", col_types =
cols(
department = col_character(),
year = col_date(format = "%Y"),
rd_budget = col_double(),
total_outlays = col_double(),
discretionary_outlays = col_double(),
gdp = col_double()

))

head(fed_rd)
## # A tibble: 6 x 6
##   department year        rd_budget total_outlays discretionary_ou…      gdp
##   <chr>      <date>          <dbl>         <dbl>             <dbl>    <dbl>
## 1 DOD        1976-01-01    3.57e10  371800000000      175600000000  1.79e12
## 2 NASA       1976-01-01    1.25e10  371800000000      175600000000  1.79e12
## 3 DOE        1976-01-01    1.09e10  371800000000      175600000000  1.79e12
## 4 HHS        1976-01-01    9.23e 9  371800000000      175600000000  1.79e12
## 5 NIH        1976-01-01    8.02e 9  371800000000      175600000000  1.79e12
## 6 NSF        1976-01-01    2.37e 9  371800000000      175600000000  1.79e12

That looks better. The same general syntax can be used for other data types. Another quick example is the massey-rating.txt dataset, which is a built-in example dataset from readr. This file is a white space delimited file, so read it in using read_table().

massey_rating <- read_table("https://raw.githubusercontent.com/tidyverse/readr/master/inst/extdata/massey-rating.txt")
## Parsed with column specification:
## cols(
##   UCC = col_double(),
##   PAY = col_double(),
##   LAZ = col_double(),
##   KPK = col_double(),
##   RT = col_double(),
##   COF = col_double(),
##   BIH = col_double(),
##   DII = col_double(),
##   ENG = col_double(),
##   ACU = col_double(),
##   Rank = col_double(),
##   Team = col_character(),
##   Conf = col_character()
## )
head(massey_rating)
## # A tibble: 6 x 13
##     UCC   PAY   LAZ   KPK    RT   COF   BIH   DII   ENG   ACU  Rank Team
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1     1     1     1     1     1     1     1     1     1     1     1 Ohio St
## 2     2     2     2     2     2     2     2     2     4     2     2 Oregon
## 3     3     4     3     4     3     4     3     4     2     3     3 Alabama
## 4     4     3     4     3     4     3     5     3     3     4     4 TCU
## 5     6     6     6     5     5     7     6     5     6    11     5 Michig…
## 6     7     7     7     6     7     6    11     8     7     8     6 Georgia
## # … with 1 more variable: Conf <chr>

Note that in both of these examples, the data file is stored on a web server and is accessed by including the full URL. If the file is local to a user’s machine, the pathname to the file should be used instead. For instance, if we had downloaded the massey-rating.txt file to our local machine’s Desktop folder, we would specify the file path argument to point to the file saved in that directory:

massey_rating <- read_table("~/Desktop/massey-rating.txt")

Note that “~” is your home directory on Mac and Linux. If you are using R on a Windows machine, remember that R requires the path names to have forward slashes (/).

### Tips to Remember

• What if a data file has unnecessary rows at the top of the file (e.g. headers or notes)? You can use the skip = n argument in any of the readr functions to skip the first n rows of the file before reading in the data. On the other hand, if you only want to read in the first n rows of a file, you can use the n_max = n argument.

• Sometimes files have missing data. Use the na argument to identify which values represent missingness in the file. For example if “.” represents a missing value you would set na = "." in the readr function used.

• If a data file does not have column names in the first row, set col_names = FALSE to generate non-specific colnames in your output tibble (e.g. X1, X2, etc.). Or you can set the column names manually, for example: col_names = c("department", "year", ...).

• By default, readr functions will make a guess on the column types based on the first 1000 rows. If you are reading in a large file with a lot of rows, there may be some errors or inconsistencies in the data format that are not seen at the very top of the file (e.g. a letter suddenly appears in row 1001 after the first 1000 rows in that column were all integers). If this is the case, the readr function will return warnings. To avoid problems like this, use the guess_max argument to set a higher value for the maximum number of rows to use for guessing column types.

### Other Resources

For further information on data import using readr, check out the following: