Reading and Writing Excel Files With R Using readxl and writexl

Caleb Scheidel

Posted on
R data import readxl writexl Excel

This tutorial walks you through how to use the readxl package to read Microsoft Excel .xls and .xlsx file formats into R, and how to export data from R back out to an Excel format using the writexl package. readxl’s functions are related to importing Excel files into a tibble object, which is modern R’s internal data format. A tibble can then be manipulated to create summary tables or plots, run statistical tests, or perform other common analysis tasks.

Setup

To install the readxl and writexl packages into your R environment run the following:

install.packages("readxl")

# writexl is not currently on CRAN, so install it from GitHub
devtools::install_github("ropensci/writexl")

Then load the packages.

library(readxl)
library(writexl)

General Usage - readxl

For this tutorial, we will work with an .xls file related to climate change data from the World Bank’s website. To download the file, click here. If you manually open this Excel file to view the data, you will see there are three sheets: Data, Country, and Series. To list the sheet names of this file in R with the readxl package, use the excel_sheets() function.

# Note: change the file path to where you saved the downloaded file, e.g. "~/Downloads/climate_change_download_0.xls"
excel_sheets("data/climate_change_download_0.xls")
## [1] "Data"    "Country" "Series"

Read each sheet in to separate tibbles using the sheet argument:

data <- read_excel("data/climate_change_download_0.xls", sheet = "Data")
data
## # A tibble: 13,512 x 28
##    `Country code` `Country name` `Series code` `Series name` SCALE Decimals
##    <chr>          <chr>          <chr>         <chr>         <dbl>    <dbl>
##  1 ABW            Aruba          AG.LND.EL5M.… Land area be…     0        1
##  2 ADO            Andorra        AG.LND.EL5M.… Land area be…     0        1
##  3 AFG            Afghanistan    AG.LND.EL5M.… Land area be…     0        1
##  4 AGO            Angola         AG.LND.EL5M.… Land area be…     0        1
##  5 ALB            Albania        AG.LND.EL5M.… Land area be…     0        1
##  6 ARE            United Arab E… AG.LND.EL5M.… Land area be…     0        1
##  7 ARG            Argentina      AG.LND.EL5M.… Land area be…     0        1
##  8 ARM            Armenia        AG.LND.EL5M.… Land area be…     0        1
##  9 ASM            American Samoa AG.LND.EL5M.… Land area be…     0        1
## 10 ATG            Antigua and B… AG.LND.EL5M.… Land area be…     0        1
## # … with 13,502 more rows, and 22 more variables: `1990` <chr>,
## #   `1991` <chr>, `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>,
## #   `1996` <chr>, `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>,
## #   `2001` <chr>, `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>,
## #   `2006` <chr>, `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>,
## #   `2011` <chr>
country <- read_excel("data/climate_change_download_0.xls", sheet = "Country")
country
## # A tibble: 233 x 6
##    `Country code` `Country name` `Capital city` Region `Income group`
##    <chr>          <chr>          <chr>          <chr>  <chr>         
##  1 EAP            East Asia & P… ..             Aggre… Aggregates    
##  2 ECA            Europe & Cent… ..             Aggre… Aggregates    
##  3 EMU            Euro area      ..             Aggre… Aggregates    
##  4 HIC            High income    ..             Aggre… Aggregates    
##  5 LAC            Latin America… ..             Aggre… Aggregates    
##  6 LIC            Low income     ..             Aggre… Aggregates    
##  7 LMC            Lower middle … ..             Aggre… Aggregates    
##  8 LMY            Low & middle … ..             Aggre… Aggregates    
##  9 MIC            Middle income  ..             Aggre… Aggregates    
## 10 MNA            Middle East &… ..             Aggre… Aggregates    
## # … with 223 more rows, and 1 more variable: `Lending category` <chr>
series <- read_excel("data/climate_change_download_0.xls", sheet = "Series")
series
## # A tibble: 58 x 8
##    `Series code` `Series name` Scale Decimals Order Topic Definition Source
##    <chr>         <chr>         <chr> <chr>    <dbl> <chr> <chr>      <chr> 
##  1 SP.POP.TOTL   Population    0     0            1 Size… Populatio… (1) U…
##  2 SP.POP.GROW   Population g… 0     1            2 Size… Annual po… Deriv…
##  3 NY.GDP.MKTP.… GDP ($)       0     0            3 Size… GDP is gr… World…
##  4 NY.GNP.PCAP.… GNI per capi… 0     0            4 Size… GNI per c… World…
##  5 EN.CLC.MMDT.C Average dail… Text  Text         5 Clim… Average d… Mitch…
##  6 EN.CLC.PCAT.C Projected an… Text  Text         6 Clim… Projected… Inter…
##  7 EN.CLC.HPPT.… Average annu… 0     0            7 Clim… Average a… Mitch…
##  8 EN.CLC.PCPT.… Projected an… Text  Text         8 Clim… Projected… Modif…
##  9 EN.CLC.PCHW   Projected ch… Text  Text         9 Clim… "Projecte… Clima…
## 10 EN.CLC.PCCC   Projected ch… Text  Text        10 Clim… "Projecte… Clima…
## # … with 48 more rows

If you are only interested in a certain subset of the data, the range argument can be specified to select which cells to read in. For example, to read in the columns related to cereal yield for the years 1990 through 1995 for country names that start with “B” in the Data tab, we would only want to read in the rectangle of cells from G481 to L498:

data <- read_excel("data/climate_change_download_0.xls",
                   sheet     = "Data",
                   range     = "G481:L498",
                   col_names = as.character(c(1990:1995)),
                   na        = "..")
data
## # A tibble: 18 x 6
##    `1990` `1991` `1992` `1993` `1994` `1995`
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1  1348.  1369.  1391.  1387.  1289.  1332.
##  2    NA     NA     NA     NA     NA     NA 
##  3   848.   878.   913.   919.   957.  1086.
##  4   600.   876.   872.   936.   788.   851 
##  5  2490.  2586.  2626.  2648.  2476.  2593.
##  6  3954.  3990   2954.  2512.  2799.  3069.
##  7    NA     NA     NA     NA     NA     NA 
##  8  2032.  1582.  1667.  1600   1793.  1793.
##  9    NA     NA   3553.  3516.  3606.  3586.
## 10    NA     NA   2741.  2796.  2271.  2065.
## 11  2014.  2276.  2098.  2131.  1754.  1777 
## 12    NA     NA     NA     NA     NA     NA 
## 13  1361.  1358   1400.  1537.  1427.  1561.
## 14  1755   1851.  2142.  2354.  2284.  2513.
## 15  2674.  2625   2571.  2667.  2600   2750 
## 16  1550   1600   1233.  2257.  1670.  1639.
## 17  1288.  1261.  1318.  1381.  1318.  1438.
## 18   266.   382.   289.   423.   227.   470.

Note that since we are not reading in the column names in the first row of the file, we need to explicitly name the columns using the col_names argument so that they are more informative. We also noticed that the value to represent missing values on this tab is “..”, so to read these in to R as NA values, we set the na argument.

General Usage - writexl

After reading in and cleaning the original Excel file, now we would like to export the tibble we created in R back out to an Excel format. This is simple to do using the write_xlsx function in the writexl package:

write_xlsx(data, "data/cereal-yield-1990-to-1995.xlsx")

Tips to Remember

  • What if a data file has unnecessary rows at the top of the file (e.g. headers or notes)? You can use the skip = n argument in the read_excel function to skip the first n rows of the file before reading in the data. On the other hand, if you only want to read in the first n rows of a file, you can use the n_max = n argument.

  • By default, the read_excel function will make a guess on the column types based on the first 1000 rows. If you are reading in a large file with a lot of rows, there may be some errors or inconsistencies in the data format that are not seen at the very top of the file (e.g. a letter suddenly appears in row 1001 after the first 1000 rows in that column were all integers). If this is the case, the read_excel function will return warnings. To avoid problems like this, use the guess_max argument to set a higher value for the maximum number of rows to use for guessing column types.

  • The writexl package is currently still under development and is designed to be very minimalistic. Other packages such as xlsx and openxlsx allow for much more flexible custom formatting of output Excel files through R.

Other Resources

For further information on data import using readxl, check out the following:

For further information on data export using writexl, check out the following: