Cleaning Data

Cleaning data is a major component of any quantitative project. Rarely will data come to you in exactly the form that you want for your research question. In this section, we will focus on three of the most important components of data cleaning: (1) assigning and handling missing values, (2) recoding variables, and (3) aggregating data.

For this analysis, we will use the same data extract from the American Community Survey that we used for the reading and writing data lab. I downloaded this extract from IPUMS. You can download the file here. You can also view the codebook that IPUMS provides

The data is a sample of individuals in the US. In the read_fwf command below, I am only going to read in a few variables that we will use for our data cleaning examples.

## # A tibble: 3,214,539 x 6
##     year statefip   sex   age yrmarr educd
##    <int>    <int> <int> <int>  <int> <int>
##  1  2018        1     2    19      0    65
##  2  2018        1     2    18      0    65
##  3  2018        1     1    53      0    64
##  4  2018        1     1    28      0    71
##  5  2018        1     2    25      0    30
##  6  2018        1     2    30      0    63
##  7  2018        1     2    66   1992    71
##  8  2018        1     1    38      0    30
##  9  2018        1     1    41      0    63
## 10  2018        1     2    18      0    65
## # … with 3,214,529 more rows

All of the variables shown here use integer values to record data, even in cases where the variable is categorical in nature. This is a common approach with many datasets, because it makes datasets smaller and more portable. Part of the cleaning process will be to use information in the codebook to convert from integer values to proper categorical variables.

The Most Important Rule: Check yourself before you wreck yourself

The number one rule of cleaning and processing data is to always check your code to make sure it is producing what you think it should be producing. In the examples below, I run several diagnostic checks to ensure that what I am producing looks correct. There are a variety of ways you can check your code, but its important to take your time here and be extra scrupulous. If you don’t properly clean your data, then everything you produce in the later analysis will be questionable at best. You don’t want this to happen to you.

Assigning missing values

One of the first things you should check in your data is how missing values are coded. In many cases, missing values will be coded with a numeric value rather than a proper missing value. If you don’t correct this, then all of your statistics will be incorrect. Codebooks that are provided with the data will normally identify the codes used for missing values. Usually, missing values are coded with numeric values that are impossible or well outside the range of the data, so that they can easily be identified. For example, the missing value for the year of marriage variable yrmarr in our dataset is zero. Since yrmarr is the calendar year of marriage, zero is a non-sensical value. However, 1,332,999 individuals have a value of zero. These are all individuals who should be NA on yrmarr because they have never been married. If you leave the zeros in, then you will drastically change any calculation you make with this variable.

Bracket and Boolean Approach

In R, we can easily code in missing values by identifying a subset of the variable with a boolean statement and then assigning the special NA value to that subset. I refer to this approach to replacing values in a variabl as the bracket-and-boolean approach.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1971    1166    1997    2018
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1937    1979    1993    1992    2006    2018 1332999

The mean and median change dramatically once I code in all of the missing values because I had a lot of never-married individuals without a proper value for yrmarr who were being reported as zero.

Ifelse approach

The bracket-and-boolean approach is one of the standard ways to re-code certain values for a variable in R, but another approach that I prefer for its greater flexibility is the ifelse command. An ifelse command takes three arguments. The first argument is a boolean statement. The second argument is what value should be returned if the first argument evaluates to TRUE and the third argument is what value should be returned if the first argument evaluates to FALSE. Here is how I can use the ifelse command in this example:

When acs$yrmarr==0 is TRUE, the ifelse command returns NA and when acs$yrmarr==0 is false, the ifelse command returns the original value of yrmarr. This command accomplishes the same thing as the bracket-and-boolean approach above. Where the ifelse command really shines is when you can string together multiple “cascading” ifelse commands as I will demonstrate below.

Recoding

Recoding variables is an integral part of preparing your data for analysis. Rarely will all the variables come in exactly the form that you want for the reserch question at hand. In many cases, you will want to collapse and/or combine categorical variables and transform quantitative variables. It will also frequently be useful to use sensitivity analysis to try out muliple approaches to coding your variables to see how sensitive your results are to coding decisions. When you recode variables, it is usually a good idea to generate a new version of the variable with a different name. This allows you to check the new variable against the old variable to ensure that your code is working correctly.

Coding Values for Categorical Variables

As I noted above, all of the data from IPUMS come to us in numeric form. For example, lets run a table command on the sex variable.

## 
##       1       2 
## 1574618 1639921

I have two unique values here: 1 and 2. How do these values compare to the categories that I expected? We can look at the codebook to see how the numeric values map onto the categories we want. Here is the relevant excerpt from the codebook:

SEX     Sex
1       Male
2       Female

I can accomplish this re-code using either the bracket-and-boolean approach or the ifelse approach. I am going to show the ifelse approach here first, because I can demonstrate the use of cascading ifelse statements. In both cases, I will assign my results to a new variable called gender.

What exactly is going on here? I have a second ifelse statement embedded inside the first statement for the final argument of that statement. What this approach does is create a cascade of ifelse statements that should pick up all of the possible cases. Anytime the boolean evaluates to TRUE, then I will assign the given value and pop out. Otherwise, I will continue on to the next statement. In simple terms, here is what the process looks like:

  1. Is the value a 1? If yes, then assign a value of “Male” to the gender variable. If no, then continue to (2).
  2. Is the value a 2? If so then assign a value of “Female” to the gender variable. If no, then move to (3).
  3. We have exhausted all possible logical cases. If you are still here, then assign a missing value.

Note that since I don’t actually have any cases other than a 1 or a 2, I could have done this with a single ifelse statement. However, by always making the final category an NA I ensure that pick up any missing values for cases that I don’t explicitly assign. This is good practice and helps us to avoid mistakes.

Since gender is a categorical variable, I really should make it a factor variable, but the command above did not turn it into a factor variable, but rather a character string variable. To “factor it up” I need to run the factor command and specify my categories:

If I really want to be efficient, I could have wrapped my cascading ifelse statements in a factor command and done it all at once.

Now we need to check ourselves before we wreck ourselves. The most straightforward way to do this is to run a table command on the original and new variable to make sure all of the observations are being coded correctly. When I run this table command, I want to include the argument exclude=NULL so that missing value codes show up in the table.

##    
##        Male  Female
##   1 1574618       0
##   2       0 1639921

Everything looks good. All the 1s are “Male” and all the 2s are “Female.”

What if I wanted to do this the bracket-and-boolean way? Well, typically the first step is to set up a new variable that is all missing values and then plug in the values one by one. Here is what it would look like:

##    
##        Male  Female
##   1 1574618       0
##   2       0 1639921

Collapsing Categorical Variables

Collapsing categorical variables means reducing the number of categories to a smaller set of categories, by lumping some categories together. As an example, take a look at the full set of categories for the detailed educational attainment category educd:

000     N/A or no schooling
001     N/A
002     No schooling completed
010     Nursery school to grade 4
011     Nursery school, preschool
012     Kindergarten
013     Grade 1, 2, 3, or 4
014     Grade 1
015     Grade 2
016     Grade 3
017     Grade 4
020     Grade 5, 6, 7, or 8
021     Grade 5 or 6
022     Grade 5
023     Grade 6
024     Grade 7 or 8
025     Grade 7
026     Grade 8
030     Grade 9
040     Grade 10
050     Grade 11
060     Grade 12
061     12th grade, no diploma
062     High school graduate or GED
063     Regular high school diploma
064     GED or alternative credential
065     Some college, but less than 1 year
070     1 year of college
071     1 or more years of college credit, no degree
080     2 years of college
081     Associate's degree, type not specified
082     Associate's degree, occupational program
083     Associate's degree, academic program
090     3 years of college
100     4 years of college
101     Bachelor's degree
110     5+ years of college
111     6 years of college (6+ in 1960-1970)
112     7 years of college
113     8+ years of college
114     Master's degree
115     Professional degree beyond a bachelor's degree
116     Doctoral degree
999     Missing

There are a a lot of categories here. Its unlikely that this fine level of detail will be helpful in any analyses that we will perform. Lets also take a look at the case counts for these categories:

## 
##      1      2     11     12     14     15     16     17     22     23     25 
##  95147  97571  45463  37691  35302  38104  41690  41318  44435  56815  45653 
##     26     30     40     50     61     63     64     65     71     81    101 
##  66566  69213  76684  85612  52431 602113 103161 190344 381192 213684 486531 
##    114    115    116 
## 217891  53833  36095

A lot of categories are not even showing up in our data. The reason for these zero categories is that the set of categories was created to capture all of the ways that educational attainment was recorded across censuses since 1850, but many of the codes are not used in the particular 2018 data that we have.

Many scholars use a parsimonious four-category coding of educational attainment: less than high school diploma, high school diploma, some college including associate’s degree, four-year college degree or more. We can collapse our full set of categories into this four-category system fairly easily using cascading ifelse statements:

##      
##       Less than HS HS Diploma Some College College   <NA>
##   1              0          0            0       0  95147
##   2          97571          0            0       0      0
##   11         45463          0            0       0      0
##   12         37691          0            0       0      0
##   14         35302          0            0       0      0
##   15         38104          0            0       0      0
##   16         41690          0            0       0      0
##   17         41318          0            0       0      0
##   22         44435          0            0       0      0
##   23         56815          0            0       0      0
##   25         45653          0            0       0      0
##   26         66566          0            0       0      0
##   30         69213          0            0       0      0
##   40         76684          0            0       0      0
##   50         85612          0            0       0      0
##   61         52431          0            0       0      0
##   63             0     602113            0       0      0
##   64             0     103161            0       0      0
##   65             0          0       190344       0      0
##   71             0          0       381192       0      0
##   81             0          0       213684       0      0
##   101            0          0            0  486531      0
##   114            0          0            0  217891      0
##   115            0          0            0   53833      0
##   116            0          0            0   36095      0

Notice how I am pairing a >= with a <= in an AND boolean statement in order to define a range of values. The statement x>=A & x<=B will return true for all values of x between A and B.

Looking at the table, its clear that all of the numeric values are showing up in the right category and all the missing values coded at 1 (these are primarily youths) are coded as missing values correctly.

In some cases, recoding may involve incorporating information on multiple variables in the construction of one new variable. The process is no different from the cases discussed above except for requiring possibly more complex boolean statements.

Transforming Quantitative Variables

Quantitative variables can also be recoded in R and this is typically much easier as you just need to apply the right mathematical equation to it. Lets say for example that I want to create a new variable from yrmarr that is the age of the respondent for the current marriage for all currently married respondents. I can get this information simply by subtracting the yrmarr variable from the year variable to get the duration of marriage and then subtracting this value from the respondent’s dage..

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0    22.0    26.0    29.5    34.0    95.0 1332999

Notice that this coding only works correctly because I already assigned missing values to the yrmarr variable above.

Creating Interval Categories From a Quantitative Variable

In some cases, I may want to turn a quantitative variable into a categorical variable that measures intervals of data. For example, I might want to turn age into ten-year age groups (e.g. 20-29, 30-39). I can do this sort of coding using the same bracket-and-boolean or cascading ifelse appoaches outlined above. However, R also provides a built-in function called cut that makes this pretty easy to do:

##   [0,10)  [10,20)  [20,30)  [30,40)  [40,50)  [50,60)  [60,70)  [70,80) 
##   338958   403401   378691   388612   380621   450316   443250   279682 
##  [80,90) [90,100) 
##   120200    30808

The most important argument to the cut function is the breakpoints for each interval. Here I have used the seq function to define a sequence of numbers from 0 to 100 at intervals of 10. I also specify right=FALSE so that the intervals are inclusive of the lower value and exclusive of the upper value. The age_group variable is now a proper factor variable and can be used like any other factor variable.

After Cleaning You Still Need to Tidy

Because you are adding new variables as you recode, the size of your dataset will often grow as you clean, organize, and re-code. This can quickly lead to a lot of clutter and this clutter can have consequences. When you have multiple variables with similar names, it is easy to mistakenly use the wrong one. To avoid confusion, it is always best to limit your final analytical dataset to just the variables that you want. The easiest way to do this in R is to put a subset command at the bottom of your organizational script and use the select argument to add variables you want to keep to it as you clean them. So, for example my command here might look like:

## # A tibble: 3,214,539 x 5
##    statefip gender edattain     age_marr   age
##       <int> <fct>  <fct>           <int> <int>
##  1        1 Female Some College       NA    19
##  2        1 Female Some College       NA    18
##  3        1 Male   HS Diploma         NA    53
##  4        1 Male   Some College       NA    28
##  5        1 Female Less than HS       NA    25
##  6        1 Female HS Diploma         NA    30
##  7        1 Female Some College       40    66
##  8        1 Male   Less than HS       NA    38
##  9        1 Male   HS Diploma         NA    41
## 10        1 Female Some College       NA    18
## # … with 3,214,529 more rows

People often keep a bunch of cluttered variables in their data frame out of a misguided fear that they will lose them if they are dropped from the current dataset. But that is the wrong way to think about your data. If you discover that you actually want a variable you dropped, just add it to the subset command and re-source your cleaning data script. In most cases this will take minutes to process at most. Thats a reasonable price to pay for the clarity of a small analytical dataset.

Aggregating Data

It is not uncommon to want to aggregate data at one unit of analysis up to a higher unit of analysis. For example, the individual-level ACS data that we have contains a variable that identifies the state of residence for each respondent (statefip). What if I wanted to aggregate my individual level data up to the state-level? As a simple example, lets say that I wanted to calculate the mean age of marriage of married individuals for each state.

The aggregate command is the easiest way to perform such tasks and it has the flexibility to do aggregations on multiple variables or multiple indices at the same time. There are two different syntax forms for aggregate but we will focus on the formula syntax. This syntax works like agg_var~by_var1+by_var2 where the agg_var is the variable that you want to aggregate on and by_var1 and by_var2 are the variables that you want to aggregate by. Here is a simple example of getting the mean for marriage duration by state.

##   statefip age_marr
## 1        1 29.27548
## 2        2 29.92585
## 3        4 30.36038
## 4        5 29.39313
## 5        6 30.02206
## 6        8 29.72513

The results here are output as a data.frame aggregated to the level of states. The dataset contains an id for the state and the mean of age_marr for that state.

The example above only aggregated one variable, but we can also get the mean of multiple variables at once by using the cbind command within the formula:

##   statefip age_marr      age
## 1        1 29.27548 56.28057
## 2        2 29.92585 53.06636
## 3        4 30.36038 56.75826
## 4        5 29.39313 55.65785
## 5        6 30.02206 55.22376
## 6        8 29.72513 54.17558

Now I get two variables for each state, mean age of marriage and mean age.

We can also aggregate across multiple dimensions at the same time. Lets get mean age of marriage again, but this time by state and educational category.

##       edattain statefip age_marr
## 1 Less than HS        1 29.63777
## 2   HS Diploma        1 29.20512
## 3 Some College        1 29.28409
## 4      College        1 29.18011
## 5 Less than HS        2 28.22311
## 6   HS Diploma        2 28.90491

Now I get four means for each state, one for each educational attainment category. Its worth thinking a little about the structure of the output here. This data structure is a little different than what we are used to. I don’t have a single line for each observation (state) giving me the age of marriage as four separate variables. Instead I have four separate lines for each state. This data format is called the long format and we are not used to working with it. We will learn more in our next lab about how we can manipulate this format to get what we want.

What if we want to aggregate the proportion of respondents in each category of a categorical variable by state? The syntax here is a little messier, but we can do this by creating booleans for each category and then calculating means:

##   statefip  prop_lhs prop_hs_grad prop_scollege prop_college_grad
## 1        1 0.2784240    0.2580242     0.2653052         0.1982465
## 2        2 0.2907537    0.2587413     0.2665113         0.1839938
## 3        4 0.2812910    0.2038612     0.2814826         0.2333653
## 4        5 0.2969432    0.2821164     0.2503300         0.1706103
## 5        6 0.2980686    0.1712014     0.2549658         0.2757642
## 6        8 0.2389480    0.1763708     0.2509729         0.3337083

Its also important to note that I have been calculating the mean in all of these examples, but I could also easily substitute a different function. For example, I could use sum on the above codes to get counts of individuals by education:

##   statefip  n_lhs n_hs_grad n_scollege n_college_grad
## 1        1  12925     11978      12316           9203
## 2        2   1871      1665       1715           1184
## 3        4  19087     13833      19100          15835
## 4        5   8772      8334       7395           5040
## 5        6 109649     62979      93793         101444
## 6        8  12956      9563      13608          18094