Reading SAS, SPSS, or Stata files into R using haven

Caleb Scheidel

Posted on
R haven tidyverse data import SAS SPSS Stata

One of the biggest benefits of using R is its flexibility in working with various types of data used by other statistical software. If you are collaborating with other researchers, they may be working with data produced by SAS (.sas7bdat), SPSS (.sav), or Stata (.dta). The haven package in R was developed specifically to import and export data in these formats. Similar to readr for rectangular text data, haven’s functions read files in these formats into a tibble object in R. A tibble can then be manipulated to create summary tables or plots, run statistical tests, or perform other common analysis tasks.

Setup

The haven package is a part of the tidyverse suite of packages, developed by RStudio. If you do not already have these packages installed to your R environment, run the following:

install.packages("tidyverse")

Then load the haven package.

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
library(haven)

General Usage

The main data import functions in haven are the following:

  • read_sas(): reads .sas7bdat and .sas7bcat files from SAS
  • read_sav(): reads .sav files from SPSS
  • read_dta(): reads .dta files from Stata.

Each file format also has a corresponding write_*() function for exporting data out of R to that file format.

SAS Example

Let’s start with an example of reading in a SAS dataset. We will use the National Youth Tobacco Survey (NYTS) data from 2017, which can be downloaded from the CDC website here. The data comes in two files, the raw data in a .sas7bdat file and the formats/labels in a .sas7bcat file. Both will be specified as arguments in read_sas().

nyts <- read_sas(data_file    = "data/nyts2017.sas7bdat",
                 catalog_file = "data/formats.sas7bcat") %>% 
  select(Month:Q3)  # keep only first few cols for example

head(nyts)
## # A tibble: 6 x 5
##   Month     Day       Q1        Q2        Q3       
##   <chr+lbl> <chr+lbl> <chr+lbl> <chr+lbl> <chr+lbl>
## 1 3         06        05        2         2        
## 2                     04        2         2        
## 3                     04        2         2        
## 4                     04        2         2        
## 5                     05        2         2        
## 6                     04        2         1

Notice that each column is a chr+lbl type, meaning the character value from the .sas7bdat file is what is shown in the tibble by default, but the data also contain a label from the .sas7bcat file that is stored behind the scenes. These chr+lbl columns imported by haven are labeled categorical variables in SAS and treated as categorical variables in R, but show up as numeric values (e.g. 1, 2, 3) in the R output. If you want to convert the chr+lbl variables to a traditional categorical variable that is internally stored and represented in the output as the labels, use the as_factor() function. Factor (fct) variables are what R calls categorical variables.

Another thing to clean up are the blank cells. We would prefer these to be showing as NA in R, instead of an empty character. To clean these up, use zap_empty().

nyts <- nyts %>% 
  mutate_all(zap_empty) %>%                   # convert any blank cells to NA
  mutate_at(vars(Month, Q1:Q3), as_factor)    # convert Month, Q1, Q2 and Q3 from chr+lbl to fct column type

head(nyts)
## # A tibble: 6 x 5
##   Month Day       Q1           Q2     Q3   
##   <fct> <chr+lbl> <fct>        <fct>  <fct>
## 1 March 06        13 years old Female 7th  
## 2 <NA>  NA        12 years old Female 7th  
## 3 <NA>  NA        12 years old Female 7th  
## 4 <NA>  NA        12 years old Female 7th  
## 5 <NA>  NA        13 years old Female 7th  
## 6 <NA>  NA        12 years old Female 6th

Great. Now we have the value labels showing up, but what do Q1, Q2, and Q3 mean? These have a label attribute that we can extract and show in our tibble using the get_label function from the sjlabelled package.

# install.packages("sjlabelled")

# print variable labels 
nyts %>% 
  sjlabelled::get_label() %>%  # pull column names from tibble
  enframe()                    # convert list of column names to a new tibble
## # A tibble: 5 x 2
##   name  value                  
##   <chr> <chr>                  
## 1 Month Month of administration
## 2 Day   Day of administration  
## 3 Q1    How old are you?       
## 4 Q2    What is your sex?      
## 5 Q3    What grade are you in?
# convert column names to variable label using purrr::set_names()
nyts <- nyts %>% 
  set_names(nyts %>% sjlabelled::get_label() %>% enframe() %>% pull(value))

nyts %>% 
  head()
## # A tibble: 6 x 5
##   `Month of admin… `Day of adminis… `How old are yo… `What is your s…
##   <fct>            <chr+lbl>        <fct>            <fct>           
## 1 March            06               13 years old     Female          
## 2 <NA>             NA               12 years old     Female          
## 3 <NA>             NA               12 years old     Female          
## 4 <NA>             NA               12 years old     Female          
## 5 <NA>             NA               13 years old     Female          
## 6 <NA>             NA               12 years old     Female          
## # … with 1 more variable: `What grade are you in?` <fct>

Even better. Now the tibble in R contains the descriptive variable names and values from the labels stored in the original SAS data files. To write the file back out to SAS, use write_sas().

write_sas(nyts, "nyts.sas7bdat")

Note that at the time of this writing, the write_sas() function is still experimental and only works for limited datasets.

SPSS Example

To read in an SPSS .sav file, use read_sav(). As an example, we will use the titanic.sav dataset, which can be downloaded here.

titanic <- read_sav("data/titanic.sav")

head(titanic)
## # A tibble: 6 x 5
##   AGE       SEX       CLASS     SURVIVED  W_C      
##   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
## 1 1         0         1         1         0        
## 2 1         0         1         1         0        
## 3 1         0         1         1         0        
## 4 1         0         1         1         0        
## 5 1         0         1         1         0        
## 6 1         0         1         1         0

Notice that each variable is a dbl+lbl type. We can use as_factor() to convert these variables to fct variables keeping and showing only the labels. We’ll also convert the variable names to their more descriptive labels if applicable, using sjlabelled::get_label, and clean up the names to be all lowercase and replacing the spaces with underscores using janitor::clean_names().

titanic <- titanic %>% 
  mutate_all(as_factor) %>%   # convert all variables from chr+lbl to fct type          
  set_names(titanic %>%       # convert variable names to informative labels
              sjlabelled::get_label() %>% # pull labels from original tibble
              enframe() %>%               # convert list of column names to a new tibble
              na_if("") %>%               # if variable label is empty string, convert to NA
              mutate(value = coalesce(value, name)) %>%  # fill in NA with original if missing
              pull(value)) %>%            # extract new variable name into a character vector
  janitor::clean_names()      # clean names to be all lowercase, replacing spaces with "_"     

head(titanic)
## # A tibble: 6 x 5
##   age_of_passaenger sex   travel_class survival_state women_children_vs_men
##   <fct>             <fct> <fct>        <fct>          <fct>                
## 1 Adult             Male  First        Alive          Man                  
## 2 Adult             Male  First        Alive          Man                  
## 3 Adult             Male  First        Alive          Man                  
## 4 Adult             Male  First        Alive          Man                  
## 5 Adult             Male  First        Alive          Man                  
## 6 Adult             Male  First        Alive          Man

To export the file back out to SPSS, use write_sav().

write_sav(titanic, "titanic.sav")
Stata Example

To read in a Stata .dta file, use read_dta(). As an example, we will use data from the British Election Study 2017 Face-to-face Post-election Survey Version 1.3 bes_f2f_2017_v1.3.dta dataset, which can be downloaded here. We will trim the dataset size for this example using select() and specifying we only want to keep the first six columns (1:6) and the first 10 rows with head(n = 10)

bes <- read_dta("data/bes_f2f_2017_v1.3.dta") %>% 
  select(1:6) %>%   # keep column numbers 1 through 6
  head(n = 10)      # keep only first 10 rows

bes
## # A tibble: 10 x 6
##    finalserialno serial   a01                          a02    a03    m02_1 
##            <dbl> <chr>    <chr>                        <dbl+> <dbl+> <dbl+>
##  1         10115 0000003… nhs                           1     2       4    
##  2         10119 0000003… brexit                        0     2       3    
##  3         10125 0000004… society                      -1     2       4    
##  4         10215 0000003… immigration                  -1     2       4    
##  5         10216 0000003… brexit                        1     2       3    
##  6         10217 0000003… getting brexit sorted        -1     2       2    
##  7         10218 0000004… brexit                        2     2       4    
##  8         10220 0000004… brexit                       -1     2       4    
##  9         10225 0000003… defence                       2     3       3    
## 10         10308 0000014… imigration and the nhs, num…  2     3      -1

Notice that some variables are a dbl+lbl type. We can use as_factor() to convert these variables to fct variables keeping and showing only the labels. We’ll also convert the variable names to their more descriptive labels if applicable, using sjlabelled::get_label, and clean up the names to be all lowercase and replacing the spaces with underscores using janitor::clean_names().

bes <- bes %>% 
  mutate_at(vars(a02:m02_1), as_factor) %>% # convert a02, a03 and m02_1 from dbl+lbl to fct
  set_names(bes %>%                       # convert variable names to informative labels
              sjlabelled::get_label() %>% # pull labels from original tibble
              enframe() %>%               # convert list of column names to a new tibble
              na_if("") %>%               # if variable label is empty string, convert to NA
              mutate(value = coalesce(value, name)) %>% # fill in NA with original if missing
              pull(value)) %>%            # extract new variable name into a character vector
  janitor::clean_names()         # clean names to be all lowercase, replacing spaces with "_"
  
bes
## # A tibble: 10 x 6
##    finalserialno serial a01   best_party_on_m… interest_in_pol…
##            <dbl> <chr>  <chr> <fct>            <fct>           
##  1         10115 00000… nhs   Labour           Fairly interest…
##  2         10119 00000… brex… None/No party    Fairly interest…
##  3         10125 00000… soci… Don`t know       Fairly interest…
##  4         10215 00000… immi… Don`t know       Fairly interest…
##  5         10216 00000… brex… Labour           Fairly interest…
##  6         10217 00000… gett… Don`t know       Fairly interest…
##  7         10218 00000… brex… Conservatives    Fairly interest…
##  8         10220 00000… brex… Don`t know       Fairly interest…
##  9         10225 00000… defe… Conservatives    Not very intere…
## 10         10308 00000… imig… Conservatives    Not very intere…
## # … with 1 more variable:
## #   politicians_dont_care_what_people_like_me_think <fct>

To export the file back out to Stata, use write_dta().

write_dta(bes, "bes.dta")

Tips to Remember

  • If you wish to remove attributes (e.g. +lbl) from a tibble, use the zap_*() functions from haven. In our example, we converted empty strings into missing values using zap_empty(), but there are more zap_*() functions available to eliminate other attributes. To get rid of all labels from labelled vectors, use zap_labels().

  • haven has functions related to manipulating labelled vectors. To create a labelled vector from a numeric or character vector use haven::labelled(). To quickly show the labels for a given labelled vector, use print_labels().

  • Different operating systems may have different character encodings. For example, Stata on Mac and Linux use a different default encoding, “latin1”. If you encounter an error such as “Unable to convert string to the requested encoding”, try setting the encoding = "latin1" argument in read_dta().

Other Resources

For further information on data import using haven, check out the following: