Reading and Writing Data

Students often get hung up at the start of a quantitative research project with the simple task of getting their data loaded into the statistical software package they are using. This is frequently a problem because data are distributed in inconsistent and often confusing ways by the agencies that release them. Knowing how to work with raw data in multiple formats is an important skill in being able to quickly get up and running with the more important parts of your analysis.

Data Formats

Data typically come to us in one of two general formats: (1) plain text, or (2) binary. R also has tools for accessing data that is loaded into some kind of database format (e.g. SQL, MS Access) and R also has tools for “scraping” your own data from online sources. However, the vast majority of data comes in either text or binary format and so that is what we will focus on in this class.

Base R comes with several helpful methods for reading in plain text data such as read.csv and read.fwf. For a long time, binary data from a variety of sources could be read in using the foreign library. However, some more recent packages that are part of the tidyverse have been developed that provide much more easy and efficient ways to read in a variety of plain text and binary formats. I will use those libraries for all of the examples here. These libraries are:

  • readr for reading in plain-text rectangular data.
  • haven for reading in binary data files from SAS, Stata, and SPSS.
  • readxl for reading in Excel files.

All three of these packages read data into an object called a tibble. Tibbles are the tidyverse upgrade to the venerable data.frame. In general, they operate just like data.frames with some nice additional features, so you shouldn’t really need to worry about it. In a pinch, a tibble can be recast as a data.frame with the as.data.frame command:

Plain text files

Plain text files (also known as ASCII files), in contrast to binary files, are easily readable across any platform without specialized (and often proprietary) software. When you write a document in WordPad in Windows or TextEdit in OSX, you are writing a plain text file. When you write a document in Microsoft Word, you are writing a binary file. Plain text files are easily transportable across a variety of different program formats, usually take up less memory, and are better for tracking changes in version control systems. If you want to share data with others, plain text is the best format to use because it is accessible regardless of statistical software or computer platform.

While plain text has the advantage of accessibility and portability, it has the disadvantage of lacking any ability to add meta-characteristics to your data. Lets say that you coded an ordinal variable for highest degree earned in your data using a factor in R. When you output this data as plain-text, the variable names will simply show up as character strings with no information about proper ordering. When re-loaded into R or another statistical software package, the ordering of the variables that you so carefully specified will not be respected and they will simply be ordered in some default manner (e.g. alphabetically in R). When organizations release their data as plain text, they typically code all variables as numeric and provide a codebook that can be used to convert things like categorical variables and missing value codes. This is a good practice for data distribution, but not terribly helpful when we want to save the analytical data that we worked so hard to organize for our own project.

Data in plain text files usually comes in one of two formats: comma-separated values (CSV) files or fixed-width files. In both cases, one line of text corresponds to an observation, or a row of data. The difference between the two formats is how to distinguish the values for different variables within a line of text (i.e. the columns).

Working with CSV files

In a comma-separated values format, the columns in the data are separated by commas. Here is an example with a very small CSV data file named “data.csv”.

name,location,race,gender,yrsed
"Bobbi","Eugene,OR","Black","Female",12
"Jim","Seattle,WA","White","Male",11
"Mary","Oakland,CA","White","Female",16

There are a couple of things to note here about the format. First, the top line is the “header” line that gives the names of the variables. Its important for your statistical program to know whether the data file has a header line or not in order to properly process the file. Second, notice that character strings representing category labels are surrounded by quotes. This is good practice because character strings may sometimes include commas within them, as the location variable does in this case, and the program will treat that as a delimiter if not surrounded by quotation marks.

We can read this data into R with the read_csv command in the readr library:

## # A tibble: 3 x 5
##   name  location   race  gender yrsed
##   <chr> <chr>      <chr> <chr>  <dbl>
## 1 Bobbi Eugene,OR  Black Female    12
## 2 Jim   Seattle,WA White Male      11
## 3 Mary  Oakland,CA White Female    16

Notice that The read_csv command tried to guess what type of variable each column should be. It correctly guessed that yrsed was a quantitative variable. You can also give more explicity information about column types with ther col_types command. The read_csv command also assumed that the first line was a header row. If that were not true, I could change that with the col_names=FALSE argument.

The use of the comma as the delimiter between columns is pretty standard today, but you will occasionally find other delimiters used. The next most common delimiter is the tab (“ in R speak). The readr package provides a handy function called read_tsv but we can also use the read_delim function to do the same thing and this will also show how easy it is to”roll your own" read function. Here is the same data as above, but this time separated by tabs and named “data_tab.txt”:

name    location    race    gender  yrsed
"Bobbi" "Eugene,OR" "Black" "Female"    12
"Jim"   "Seattle,WA"    "White" "Male"  11
"Mary"  "Oakland,CA"    "White" "Female"    16

I can read it in using the with the read_delim function:

## # A tibble: 3 x 5
##   name  location   race  gender yrsed
##   <chr> <chr>      <chr> <chr>  <dbl>
## 1 Bobbi Eugene,OR  Black Female    12
## 2 Jim   Seattle,WA White Male      11
## 3 Mary  Oakland,CA White Female    16

All I had to do was specify what character served as the delimiter. In actuality, read_csv and read_tsv are just convenienct functions taht both call read_delim. If you access the help file for read_delim, you will see that there are many different arguments for dealing with specific problems that might arise in your dataset. Lets say for example that I had a data file saved as “data_messy.csv” that looked like this:

*Some data that I collected
*I cant remember when
name of person,location of person,racial category,gender of person,years of education
name,location,race,gender,yrsed
"Bobbi","Eugene,OR","Black","Female",12
"Jim","Seattle,WA","na","Male",11
"Mary","Oakland,CA","White","Female",16

There are several complications here. First, there are a couple of comment lines at the top where the comment symbol is "*" that I don’t want to get processed. I could handle this with either the skip argument to skip a certain number of rows before reading the data or the comment argument to define what lines to skip by what character they start with. Second, there is a line above the proper headers with a description of each variable. I could use the skip option here again to skip this line. Finally, the lower-case “na” won’t be recognized by default as a missing value by R, but the option na will allow me to specify additional character strings like “na” that should be interpreted as missing values. All together, I use the command:

## # A tibble: 4 x 5
##   `name of person` `location of pe… `racial categor… `gender of pers…
##   <chr>            <chr>            <chr>            <chr>           
## 1 name             location         race             gender          
## 2 Bobbi            Eugene,OR        Black            Female          
## 3 Jim              Seattle,WA       <NA>             Male            
## 4 Mary             Oakland,CA       White            Female          
## # … with 1 more variable: `years of education` <chr>

Note that because I used the comment argument I only specify skip=1 .

Working with fixed-width text files

The second form that data in text format can take is “fixed-width” format where the specific length of each variable in terms of the number of characters is specified. For example, here is the same dataset in fixed-width format, saved as “data_fw.txt”:

BobbiEugene,OR BlackFemale12
Jim  Seattle,WAWhiteMale  11
Mary Oakland,CAWhiteFemale16

Notice that the actual starting location of each variable is the same within each row. If you count the characters up, you will see that the first variable has a width of 5 characters, the second variable has a width of 10 characters, and so on. Note also that this file does not contain headers which is pretty typical of fixed width files. We can use the read_fwf command by feeding in this data, by explicitly specifying the starting and endind positions of each variable:

## # A tibble: 3 x 5
##   name  location   race  gender yrsed
##   <chr> <chr>      <chr> <chr>  <dbl>
## 1 Bobbi Eugene,OR  Black Female    12
## 2 Jim   Seattle,WA White Male      11
## 3 Mary  Oakland,CA White Female    16

I have to do a little more work than I did with CSV files. First I need to define a col_positions argument and feed in the results from the function fwf_positions. This fwf_positions function needs to have a vector of starting and ending positions for each variable. I also use the col_names argument in this function to assign names to each of my variables.

Now I want to show you an example of fixed width data that is more realistic. The data I will use here were created from an extract of the 2018 American Community Survey data downloaded from IPUMS. The data come zipped up as a “G-zip” file with suffix “gz” which is a standard method of compressing data files. You can download the file here. You can also view the codebook that IPUMS provides which shows where the starting and ending positions are for each variable. Here is the relevant part of that codebook:

File Type:                    rectangular
Case Selection:               No
  Variable               Columns        Len    2018   
  YEAR               H   1-4            4      X 
  SAMPLE             H   5-10           6      X 
  SERIAL             H  11-18           8      X 
  CBSERIAL           H  19-31          13      X 
  HHWT               H  32-41          10      X 
  CLUSTER            H  42-54          13      X 
  STATEFIP           H  55-56           2      X 
  STRATA             H  57-68          12      X 
  GQ                 H  69              1      X 
  PERNUM             P  70-73           4      X 
  PERWT              P  74-83          10      X 
  SEX                P  84              1      X 
  AGE                P  85-87           3      X 
  YRMARR             P  88-91           4      X 
  EDUC               P  92-93           2      X 
  EDUCD              P  94-96           3      X 

I don’t need all of these variables. Some of them are just defaults that come with every IPUMS extract like HHWT, CLUSTER, and STRATA. For my purposes, lets say I only want the variables of STATEFIP, YEAR, SEX, AGE, YRMARR, and EDUCD. Using read_fwf, I can just skip the indices of variables I don’t need. Another nice feature of the readr package is that all of the functions can read directly directly from zipped data, which saves me room on my computer and my git repository (GitHub has a limit of 100MB for files). Here is my command to read in the data:

## # A tibble: 3,214,539 x 6
##     year statefip   sex   age yrmarr educd
##    <int>    <int> <int> <int>  <int> <int>
##  1  2018        1     2    19      0    65
##  2  2018        1     2    18      0    65
##  3  2018        1     1    53      0    64
##  4  2018        1     1    28      0    71
##  5  2018        1     2    25      0    30
##  6  2018        1     2    30      0    63
##  7  2018        1     2    66   1992    71
##  8  2018        1     1    38      0    30
##  9  2018        1     1    41      0    63
## 10  2018        1     2    18      0    65
## # … with 3,214,529 more rows

I am also using the col_types argument to force read_fwf to read in all variables as numeric integers because this is how all the data come from IPUMS. I also set progress=FALSE for the output here, but I can ghange that to the progress=TRUE option to get a progresss bar as my data is read in that lets me know how long I can expect to wait.

Data in binary format

As a result of initiatives to make science more open, data is increasingly becoming available in simple text format, which improves its portability and accessibility. However, there are still many cases where data is available in a binary format that is readable only by a specific statistical software program. For example, the quick download page for the General Social Survey provides the comprehensive GSS data files but only in Stata and SPSS formats. Thats fine if you have purchased that software, but you are out of luck if you have not. This approach is Bad For Science, but you will still run into it quite a bit.

Luckily, R can usually still read those files. For aeons upon aeons, the preferred packages for this was foreign, but the tidyverse has changed the game. The haven package can easily read in data from Stata, SAS, and SPSS. The readxl package can easily read in data from an Excel file.

To test these libraries out, I have loaded the data exampleI have been using into Stata and saved it as a binary stata dataset (*.dta). I can use the read_dta function in haven to read in this data.

## # A tibble: 3 x 5
##   name  location   gender yrsed      race
##   <chr> <chr>      <chr>  <dbl> <dbl+lbl>
## 1 Bobbi Eugene,OR  Female    12 1 [Black]
## 2 Jim   Seattle,WA Male      11 2 [White]
## 3 Mary  Oakland,CA Female    16 2 [White]

Everything looks good except that because I didn’t encode location and gender in Stata as categorical variables, those variables show up in R as character strings rather than factor variables. If I wanted to turn my either variable into a proper factor (categorical) variable, I can use the factor command:

## Female   Male 
##      2      1

Saving data

R has its own binary format for keeping track of data. You can save any object or set of objects to your filesystem with the save command. This will save a file in a binary *.RData format. These objects can then be loaded back into R with the load command:

## # A tibble: 3 x 5
##   name  location   gender yrsed      race
##   <chr> <chr>      <fct>  <dbl> <dbl+lbl>
## 1 Bobbi Eugene,OR  Female    12 1 [Black]
## 2 Jim   Seattle,WA Male      11 2 [White]
## 3 Mary  Oakland,CA Female    16 2 [White]

I recommend saving your own analytical data as RData files because these files are light-weight and contain all of the added meta-information that you have created for categorical variables and such. However, for sharing data more widely I recommend that you provide data in plain text format. The readr has a variety of write_* functions including write_csv, write_delim, and write_tsv for outputing data as delimited plain text. For example, I could write my data to a csv with:

name,location,gender,yrsed,race
Bobbi,"Eugene,OR",Female,12,1
Jim,"Seattle,WA",Male,11,2
Mary,"Oakland,CA",Female,16,2

The haven package also has several write_* functions for outputing data in other binary formats. Here are the most common:

  • write_dta for a Stata data file
  • write_sav for a SPSS data file
  • write_xpt for a SAS transport file and write_sas for a SAS data file