Cleaning Data
Cleaning data is a major component of any quantitative project. Rarely will data come to you in exactly the form that you want for your research question. In this section, we will focus on three of the most important components of data cleaning: (1) assigning and handling missing values, (2) recoding variables, and (3) aggregating data.
For this analysis, we will use the same data extract from the American Community Survey that we used for the reading and writing data lab. I downloaded this extract from IPUMS. You can download the file here. You can also view the codebook that IPUMS provides
The data is a sample of individuals in the US. In the read_fwf
command below, I am only going to read in a few variables that we will use for our data cleaning examples.
acs <- read_fwf("resources/usa_00084.dat.gz",
col_positions = fwf_positions(start = c(1,55,84,85,88,94),
end = c(4,56,84,87,91,96),
col_names = c("year","statefip","sex",
"age","yrmarr","educd")),
col_types = cols(.default = "i"),
progress = FALSE)
acs
## # A tibble: 3,214,539 x 6
## year statefip sex age yrmarr educd
## <int> <int> <int> <int> <int> <int>
## 1 2018 1 2 19 0 65
## 2 2018 1 2 18 0 65
## 3 2018 1 1 53 0 64
## 4 2018 1 1 28 0 71
## 5 2018 1 2 25 0 30
## 6 2018 1 2 30 0 63
## 7 2018 1 2 66 1992 71
## 8 2018 1 1 38 0 30
## 9 2018 1 1 41 0 63
## 10 2018 1 2 18 0 65
## # … with 3,214,529 more rows
All of the variables shown here use integer values to record data, even in cases where the variable is categorical in nature. This is a common approach with many datasets, because it makes datasets smaller and more portable. Part of the cleaning process will be to use information in the codebook to convert from integer values to proper categorical variables.
The Most Important Rule: Check yourself before you wreck yourself
The number one rule of cleaning and processing data is to always check your code to make sure it is producing what you think it should be producing. In the examples below, I run several diagnostic checks to ensure that what I am producing looks correct. There are a variety of ways you can check your code, but its important to take your time here and be extra scrupulous. If you don’t properly clean your data, then everything you produce in the later analysis will be questionable at best. You don’t want this to happen to you.
Assigning missing values
One of the first things you should check in your data is how missing values are coded. In many cases, missing values will be coded with a numeric value rather than a proper missing value. If you don’t correct this, then all of your statistics will be incorrect. Codebooks that are provided with the data will normally identify the codes used for missing values. Usually, missing values are coded with numeric values that are impossible or well outside the range of the data, so that they can easily be identified. For example, the missing value for the year of marriage variable yrmarr
in our dataset is zero. Since yrmarr
is the calendar year of marriage, zero is a non-sensical value. However, 1,332,999 individuals have a value of zero. These are all individuals who should be NA
on yrmarr
because they have never been married. If you leave the zeros in, then you will drastically change any calculation you make with this variable.
Bracket and Boolean Approach
In R, we can easily code in missing values by identifying a subset of the variable with a boolean statement and then assigning the special NA
value to that subset. I refer to this approach to replacing values in a variabl as the bracket-and-boolean approach.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1971 1166 1997 2018
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1937 1979 1993 1992 2006 2018 1332999
The mean and median change dramatically once I code in all of the missing values because I had a lot of never-married individuals without a proper value for yrmarr
who were being reported as zero.
Ifelse approach
The bracket-and-boolean approach is one of the standard ways to re-code certain values for a variable in R, but another approach that I prefer for its greater flexibility is the ifelse
command. An ifelse
command takes three arguments. The first argument is a boolean statement. The second argument is what value should be returned if the first argument evaluates to TRUE and the third argument is what value should be returned if the first argument evaluates to FALSE. Here is how I can use the ifelse
command in this example:
When acs$yrmarr==0
is TRUE, the ifelse
command returns NA
and when acs$yrmarr==0
is false, the ifelse
command returns the original value of yrmarr
. This command accomplishes the same thing as the bracket-and-boolean approach above. Where the ifelse
command really shines is when you can string together multiple “cascading” ifelse
commands as I will demonstrate below.
Recoding
Recoding variables is an integral part of preparing your data for analysis. Rarely will all the variables come in exactly the form that you want for the reserch question at hand. In many cases, you will want to collapse and/or combine categorical variables and transform quantitative variables. It will also frequently be useful to use sensitivity analysis to try out muliple approaches to coding your variables to see how sensitive your results are to coding decisions. When you recode variables, it is usually a good idea to generate a new version of the variable with a different name. This allows you to check the new variable against the old variable to ensure that your code is working correctly.
Coding Values for Categorical Variables
As I noted above, all of the data from IPUMS come to us in numeric form. For example, lets run a table command on the sex
variable.
##
## 1 2
## 1574618 1639921
I have two unique values here: 1 and 2. How do these values compare to the categories that I expected? We can look at the codebook to see how the numeric values map onto the categories we want. Here is the relevant excerpt from the codebook:
SEX Sex
1 Male
2 Female
I can accomplish this re-code using either the bracket-and-boolean approach or the ifelse approach. I am going to show the ifelse approach here first, because I can demonstrate the use of cascading ifelse
statements. In both cases, I will assign my results to a new variable called gender
.
What exactly is going on here? I have a second ifelse
statement embedded inside the first statement for the final argument of that statement. What this approach does is create a cascade of ifelse
statements that should pick up all of the possible cases. Anytime the boolean evaluates to TRUE, then I will assign the given value and pop out. Otherwise, I will continue on to the next statement. In simple terms, here is what the process looks like:
- Is the value a 1? If yes, then assign a value of “Male” to the
gender
variable. If no, then continue to (2). - Is the value a 2? If so then assign a value of “Female” to the
gender
variable. If no, then move to (3). - We have exhausted all possible logical cases. If you are still here, then assign a missing value.
Note that since I don’t actually have any cases other than a 1 or a 2, I could have done this with a single ifelse
statement. However, by always making the final category an NA
I ensure that pick up any missing values for cases that I don’t explicitly assign. This is good practice and helps us to avoid mistakes.
Since gender
is a categorical variable, I really should make it a factor variable, but the command above did not turn it into a factor variable, but rather a character string variable. To “factor it up” I need to run the factor command and specify my categories:
If I really want to be efficient, I could have wrapped my cascading ifelse
statements in a factor
command and done it all at once.
acs$gender <- factor(ifelse(acs$sex==1, "Male",
ifelse(acs$sex==2, "Female", NA)),
levels=c("Male","Female"))
Now we need to check ourselves before we wreck ourselves. The most straightforward way to do this is to run a table
command on the original and new variable to make sure all of the observations are being coded correctly. When I run this table command, I want to include the argument exclude=NULL
so that missing value codes show up in the table.
##
## Male Female
## 1 1574618 0
## 2 0 1639921
Everything looks good. All the 1s are “Male” and all the 2s are “Female.”
What if I wanted to do this the bracket-and-boolean way? Well, typically the first step is to set up a new variable that is all missing values and then plug in the values one by one. Here is what it would look like:
acs$gender <- NA
acs$gender[acs$sex==1] <- "Male"
acs$gender[acs$sex==2] <- "Female"
acs$gender <- factor(acs$gender,
levels=c("Male","Female"))
table(acs$sex, acs$gender, exclude=NULL)
##
## Male Female
## 1 1574618 0
## 2 0 1639921
Collapsing Categorical Variables
Collapsing categorical variables means reducing the number of categories to a smaller set of categories, by lumping some categories together. As an example, take a look at the full set of categories for the detailed educational attainment category educd
:
000 N/A or no schooling
001 N/A
002 No schooling completed
010 Nursery school to grade 4
011 Nursery school, preschool
012 Kindergarten
013 Grade 1, 2, 3, or 4
014 Grade 1
015 Grade 2
016 Grade 3
017 Grade 4
020 Grade 5, 6, 7, or 8
021 Grade 5 or 6
022 Grade 5
023 Grade 6
024 Grade 7 or 8
025 Grade 7
026 Grade 8
030 Grade 9
040 Grade 10
050 Grade 11
060 Grade 12
061 12th grade, no diploma
062 High school graduate or GED
063 Regular high school diploma
064 GED or alternative credential
065 Some college, but less than 1 year
070 1 year of college
071 1 or more years of college credit, no degree
080 2 years of college
081 Associate's degree, type not specified
082 Associate's degree, occupational program
083 Associate's degree, academic program
090 3 years of college
100 4 years of college
101 Bachelor's degree
110 5+ years of college
111 6 years of college (6+ in 1960-1970)
112 7 years of college
113 8+ years of college
114 Master's degree
115 Professional degree beyond a bachelor's degree
116 Doctoral degree
999 Missing
There are a a lot of categories here. Its unlikely that this fine level of detail will be helpful in any analyses that we will perform. Lets also take a look at the case counts for these categories:
##
## 1 2 11 12 14 15 16 17 22 23 25
## 95147 97571 45463 37691 35302 38104 41690 41318 44435 56815 45653
## 26 30 40 50 61 63 64 65 71 81 101
## 66566 69213 76684 85612 52431 602113 103161 190344 381192 213684 486531
## 114 115 116
## 217891 53833 36095
A lot of categories are not even showing up in our data. The reason for these zero categories is that the set of categories was created to capture all of the ways that educational attainment was recorded across censuses since 1850, but many of the codes are not used in the particular 2018 data that we have.
Many scholars use a parsimonious four-category coding of educational attainment: less than high school diploma, high school diploma, some college including associate’s degree, four-year college degree or more. We can collapse our full set of categories into this four-category system fairly easily using cascading ifelse
statements:
acs$edattain <- factor(ifelse(acs$educd>=2 & acs$educd<=61, "Less than HS",
ifelse(acs$educd>=62 & acs$educd<=64, "HS Diploma",
ifelse(acs$educd>=65 & acs$educd<=100, "Some College",
ifelse(acs$educd>=101 & acs$educd<=116, "College",
NA)))),
levels=c("Less than HS","HS Diploma","Some College","College"))
table(acs$educd, acs$edattain, exclude=NULL)
##
## Less than HS HS Diploma Some College College <NA>
## 1 0 0 0 0 95147
## 2 97571 0 0 0 0
## 11 45463 0 0 0 0
## 12 37691 0 0 0 0
## 14 35302 0 0 0 0
## 15 38104 0 0 0 0
## 16 41690 0 0 0 0
## 17 41318 0 0 0 0
## 22 44435 0 0 0 0
## 23 56815 0 0 0 0
## 25 45653 0 0 0 0
## 26 66566 0 0 0 0
## 30 69213 0 0 0 0
## 40 76684 0 0 0 0
## 50 85612 0 0 0 0
## 61 52431 0 0 0 0
## 63 0 602113 0 0 0
## 64 0 103161 0 0 0
## 65 0 0 190344 0 0
## 71 0 0 381192 0 0
## 81 0 0 213684 0 0
## 101 0 0 0 486531 0
## 114 0 0 0 217891 0
## 115 0 0 0 53833 0
## 116 0 0 0 36095 0
Notice how I am pairing a >=
with a <=
in an AND boolean statement in order to define a range of values. The statement x>=A & x<=B
will return true for all values of x between A and B.
Looking at the table, its clear that all of the numeric values are showing up in the right category and all the missing values coded at 1 (these are primarily youths) are coded as missing values correctly.
In some cases, recoding may involve incorporating information on multiple variables in the construction of one new variable. The process is no different from the cases discussed above except for requiring possibly more complex boolean statements.
Transforming Quantitative Variables
Quantitative variables can also be recoded in R and this is typically much easier as you just need to apply the right mathematical equation to it. Lets say for example that I want to create a new variable from yrmarr
that is the age of the respondent for the current marriage for all currently married respondents. I can get this information simply by subtracting the yrmarr
variable from the year
variable to get the duration of marriage and then subtracting this value from the respondent’s dage..
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.0 22.0 26.0 29.5 34.0 95.0 1332999
Notice that this coding only works correctly because I already assigned missing values to the yrmarr
variable above.
Creating Interval Categories From a Quantitative Variable
In some cases, I may want to turn a quantitative variable into a categorical variable that measures intervals of data. For example, I might want to turn age into ten-year age groups (e.g. 20-29, 30-39). I can do this sort of coding using the same bracket-and-boolean or cascading ifelse appoaches outlined above. However, R also provides a built-in function called cut
that makes this pretty easy to do:
acs$age_group <- cut(acs$age,
breaks=seq(from=0, to=100, by=10),
right=FALSE)
summary(acs$age_group)
## [0,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80)
## 338958 403401 378691 388612 380621 450316 443250 279682
## [80,90) [90,100)
## 120200 30808
The most important argument to the cut
function is the breakpoints for each interval. Here I have used the seq
function to define a sequence of numbers from 0 to 100 at intervals of 10. I also specify right=FALSE
so that the intervals are inclusive of the lower value and exclusive of the upper value. The age_group
variable is now a proper factor variable and can be used like any other factor variable.
After Cleaning You Still Need to Tidy
Because you are adding new variables as you recode, the size of your dataset will often grow as you clean, organize, and re-code. This can quickly lead to a lot of clutter and this clutter can have consequences. When you have multiple variables with similar names, it is easy to mistakenly use the wrong one. To avoid confusion, it is always best to limit your final analytical dataset to just the variables that you want. The easiest way to do this in R is to put a subset
command at the bottom of your organizational script and use the select argument to add variables you want to keep to it as you clean them. So, for example my command here might look like:
## # A tibble: 3,214,539 x 5
## statefip gender edattain age_marr age
## <int> <fct> <fct> <int> <int>
## 1 1 Female Some College NA 19
## 2 1 Female Some College NA 18
## 3 1 Male HS Diploma NA 53
## 4 1 Male Some College NA 28
## 5 1 Female Less than HS NA 25
## 6 1 Female HS Diploma NA 30
## 7 1 Female Some College 40 66
## 8 1 Male Less than HS NA 38
## 9 1 Male HS Diploma NA 41
## 10 1 Female Some College NA 18
## # … with 3,214,529 more rows
People often keep a bunch of cluttered variables in their data frame out of a misguided fear that they will lose them if they are dropped from the current dataset. But that is the wrong way to think about your data. If you discover that you actually want a variable you dropped, just add it to the subset command and re-source your cleaning data script. In most cases this will take minutes to process at most. Thats a reasonable price to pay for the clarity of a small analytical dataset.
Aggregating Data
It is not uncommon to want to aggregate data at one unit of analysis up to a higher unit of analysis. For example, the individual-level ACS data that we have contains a variable that identifies the state of residence for each respondent (statefip
). What if I wanted to aggregate my individual level data up to the state-level? As a simple example, lets say that I wanted to calculate the mean age of marriage of married individuals for each state.
The aggregate
command is the easiest way to perform such tasks and it has the flexibility to do aggregations on multiple variables or multiple indices at the same time. There are two different syntax forms for aggregate
but we will focus on the formula syntax. This syntax works like agg_var~by_var1+by_var2
where the agg_var
is the variable that you want to aggregate on and by_var1
and by_var2
are the variables that you want to aggregate by. Here is a simple example of getting the mean for marriage duration by state.
## statefip age_marr
## 1 1 29.27548
## 2 2 29.92585
## 3 4 30.36038
## 4 5 29.39313
## 5 6 30.02206
## 6 8 29.72513
The results here are output as a data.frame
aggregated to the level of states. The dataset contains an id for the state and the mean of age_marr
for that state.
The example above only aggregated one variable, but we can also get the mean of multiple variables at once by using the cbind
command within the formula:
## statefip age_marr age
## 1 1 29.27548 56.28057
## 2 2 29.92585 53.06636
## 3 4 30.36038 56.75826
## 4 5 29.39313 55.65785
## 5 6 30.02206 55.22376
## 6 8 29.72513 54.17558
Now I get two variables for each state, mean age of marriage and mean age.
We can also aggregate across multiple dimensions at the same time. Lets get mean age of marriage again, but this time by state and educational category.
## edattain statefip age_marr
## 1 Less than HS 1 29.63777
## 2 HS Diploma 1 29.20512
## 3 Some College 1 29.28409
## 4 College 1 29.18011
## 5 Less than HS 2 28.22311
## 6 HS Diploma 2 28.90491
Now I get four means for each state, one for each educational attainment category. Its worth thinking a little about the structure of the output here. This data structure is a little different than what we are used to. I don’t have a single line for each observation (state) giving me the age of marriage as four separate variables. Instead I have four separate lines for each state. This data format is called the long format and we are not used to working with it. We will learn more in our next lab about how we can manipulate this format to get what we want.
What if we want to aggregate the proportion of respondents in each category of a categorical variable by state? The syntax here is a little messier, but we can do this by creating booleans for each category and then calculating means:
temp <- aggregate(cbind(prop_lhs=edattain=="Less than HS",
prop_hs_grad=edattain=="HS Diploma",
prop_scollege=edattain=="Some College",
prop_college_grad=edattain=="College")~statefip,
data=acs, mean, na.rm=TRUE)
head(temp)
## statefip prop_lhs prop_hs_grad prop_scollege prop_college_grad
## 1 1 0.2784240 0.2580242 0.2653052 0.1982465
## 2 2 0.2907537 0.2587413 0.2665113 0.1839938
## 3 4 0.2812910 0.2038612 0.2814826 0.2333653
## 4 5 0.2969432 0.2821164 0.2503300 0.1706103
## 5 6 0.2980686 0.1712014 0.2549658 0.2757642
## 6 8 0.2389480 0.1763708 0.2509729 0.3337083
Its also important to note that I have been calculating the mean in all of these examples, but I could also easily substitute a different function. For example, I could use sum
on the above codes to get counts of individuals by education:
temp <- aggregate(cbind(n_lhs=edattain=="Less than HS",
n_hs_grad=edattain=="HS Diploma",
n_scollege=edattain=="Some College",
n_college_grad=edattain=="College")~statefip,
data=acs, sum, na.rm=TRUE)
head(temp)
## statefip n_lhs n_hs_grad n_scollege n_college_grad
## 1 1 12925 11978 12316 9203
## 2 2 1871 1665 1715 1184
## 3 4 19087 13833 19100 15835
## 4 5 8772 8334 7395 5040
## 5 6 109649 62979 93793 101444
## 6 8 12956 9563 13608 18094