Reshaping and Merging Data
In this lab, I want to cover two important components of data wrangling, reshaping data from one format to one another and merging two diffent datasets together. As an example for this section, we are going to use some data from The World Bank Development Indicators and the International Labour Organization. Both of these sites have nice designs that allow users to filter an extract of their choosing by country, year, and variable type. They also allow download of the data in a variety of formats including CSV. The labor data provides information on unemployment rates by age and sex and I have limited the years to 2014 or later. I then pulled an extract of the world bank data from 2014 on all countries for the variables of GDP per capita, life expectancy at birth, and population size. You can download the World Bank extract here and the ILO data here.1
Ultimately, I want to create a final dataset which combined these sources. The final dataset should have an observation for each country with the variables of GDP per capita, life expectancy at birth, population size, and unemployment rates. In order to get there though, I am going to have to deal with some data issues.
Reshaping
All of the data that we have looked at so far have been in what is called a wide format. Each row is an observation is a column. This is usually the easiest way for us to work with data, but sometimes data can come in a long format instead. In the long format, each row gives the value for a certain variable for a certain observation. The observation and variable are identified by id columns and the values themselves are a third column.
To give a simple example, here is a very small dataset in wide form:
name | age | height |
---|---|---|
Bob | 25 | 71 |
Susie | 37 | 65 |
Linda | 28 | 68 |
Here is that same dataset in long form:
name | variable | value |
---|---|---|
Bob | age | 25 |
Bob | height | 71 |
Susie | age | 37 |
Susie | height | 65 |
Linda | age | 28 |
Linda | height | 68 |
Notice that in the long format each person shows up once for each variable from the wide dataset. In this case, the “name” variable is my id variable that identified unique observations. Its possible to have more than one id variable in the long format. For example, I might also have a year variable which identifies an observation-year case.
Lets take a look at the world bank data that I have and you will see another example of the long format.
library(readr)
worldbank <- read_csv("resources/worldbank.csv")
#better variable names
colnames(worldbank) <- c("country_name","country_code","series_name","series_code",
"time","time_code","value")
head(worldbank)
## # A tibble: 6 x 7
## country_name country_code series_name series_code time time_code value
## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AFG GDP per capita … NY.GDP.PCAP… 2014 YR2014 6.34e2
## 2 Afghanistan AFG Life expectancy… SP.DYN.LE00… 2014 YR2014 6.04e1
## 3 Afghanistan AFG Population, tot… SP.POP.TOTL 2014 YR2014 3.16e7
## 4 Albania ALB GDP per capita … NY.GDP.PCAP… 2014 YR2014 4.57e3
## 5 Albania ALB Life expectancy… SP.DYN.LE00… 2014 YR2014 7.78e1
## 6 Albania ALB Population, tot… SP.POP.TOTL 2014 YR2014 2.89e6
You can see that this data is in long format with the actual value for each of my variables given in the “Value” colum. The “Country Code” variable identifies the country and the “Series Code” variable identifies the specific variable. I also have a “Time Code” but in my case that is not particularly relevant since I only downloaded 2014 data. This is the standard format of the World Bank’s World Development Indicators data.
In order to create my analytical dataset I want to reshape this data from long to wide. Given what we know about data.frames in R, it is actually possible to manipulate this dataset with our existing skills to get it in the right format, but this is not the best approach because it opens up to significant errors. We could for example, use a subset
command to just grab each variable’s value and save those two different vectors. We could then use a data.frame
command or a cbind
command to comnbine those together into a new data.frame. However, we would be assuming that the ordering and existence of countries was consistent across all of the vectors that we are combining - and this assumption could prove dangerous if it turned out that in one vector country values were reversed or some were missing. We would then have incorrect values for some countries.
A better approach would be to save subsets that include the country code and then use the merge
command that we will learn below to merge them together into a single dataset by country code. This will ensure that the values are correct for countries, but it still involves quite a bit of code.
An even better solution is to use one of the functions in R that can do this reshaping for us. Base R comes with a reshape
command but we will not be using that command here. Instead we are going to use the dcast
function from the reshape2
library. You will need to install the reshape2
library to gain access to it.
The dcast
function in the reshape2
library is one of the “cast” functions that will turn a dataset in long format to a dataset in wide format. There is an analagous function called melt
to for from wide to long format (melting and casting, get it?). To use dcast
we need to feed in three arguments:
- the name of the data.frame object in long form
- A formula that defines the id variable(s) and the variable that identifies distinct variables from the wide format.
- A
values.var
variable that identifies which variable contains the actual values.
In my case, it should look like:
library(reshape2)
worldbank_wide <- dcast(worldbank, country_code + country_name + time ~ series_code,
value.var = "value")
head(worldbank_wide)
## country_code country_name time NY.GDP.PCAP.CD SP.DYN.LE00.IN
## 1 ABW Aruba 2014 NA 75.45110
## 2 ADO Andorra 2014 NA NA
## 3 AFG Afghanistan 2014 633.9479 60.37446
## 4 AGO Angola 2014 5232.7623 52.26688
## 5 ALB Albania 2014 4568.5688 77.83046
## 6 ARE United Arab Emirates 2014 44238.5998 77.36817
## SP.POP.TOTL
## 1 103441
## 2 72786
## 3 31627506
## 4 24227524
## 5 2893654
## 6 9086139
Just like that we have data in wide format. Notice how the formula works here. My id variable(s) go to the left of the tilde. In this case, I didn’t really need to specify “time” because I only have one year, but I wanted to show you how this can work with multiple id variables. I also didn’t need both “country_code” and “country_name” but by doing it this way, I will get both variables in my wide data format, which can be useful if I need to look up which country is which code. The series_code that identifies variable names for the wide dataset goes to the right of the tilde.
My only problem now is that the variable names are ugly, but I can fix that up pretty easily:
colnames(worldbank_wide) <- c("country_code","country_name","year",
"gdp_cap","life_exp","pop")
head(worldbank_wide)
## country_code country_name year gdp_cap life_exp pop
## 1 ABW Aruba 2014 NA 75.45110 103441
## 2 ADO Andorra 2014 NA NA 72786
## 3 AFG Afghanistan 2014 633.9479 60.37446 31627506
## 4 AGO Angola 2014 5232.7623 52.26688 24227524
## 5 ALB Albania 2014 4568.5688 77.83046 2893654
## 6 ARE United Arab Emirates 2014 44238.5998 77.36817 9086139
Now lets take a look at the ILO data and see what needs to be done with it. Note that I needed to specify a few column types that were giving readr
parsing problems, although we don’t really care about those variables anyway. I also recast the tibble
to a standard data.frame
for the head
command so you can see all of the variables.
ilo <- read_csv("resources/ilostat.csv",
col_types=cols(obs_status=col_character(),
obs_status.label=col_character()))
head(as.data.frame(ilo))
## collection collection.label ref_area ref_area.label source
## 1 YI Yearly indicators ALB Albania BA:480
## 2 YI Yearly indicators ALB Albania BA:480
## 3 YI Yearly indicators ALB Albania BA:480
## 4 YI Yearly indicators ALB Albania BA:480
## 5 YI Yearly indicators ALB Albania BA:480
## 6 YI Yearly indicators ALB Albania BA:480
## source.label indicator indicator.label
## 1 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 2 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 3 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 4 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 5 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 6 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## sex sex.label classif1 classif1.label classif2
## 1 SEX_T Sex: Total AGE_10YRBANDS_TOTAL 10-year age bands: Total NA
## 2 SEX_T Sex: Total AGE_10YRBANDS_Y15-24 10-year age bands: 15-24 NA
## 3 SEX_M Sex: Male AGE_10YRBANDS_TOTAL 10-year age bands: Total NA
## 4 SEX_M Sex: Male AGE_10YRBANDS_Y15-24 10-year age bands: 15-24 NA
## 5 SEX_F Sex: Female AGE_10YRBANDS_TOTAL 10-year age bands: Total NA
## 6 SEX_F Sex: Female AGE_10YRBANDS_Y15-24 10-year age bands: 15-24 NA
## classif2.label classif3 classif3.label classif4 classif4.label classif5
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## classif5.label time obs_value obs_status obs_status.label note_classif
## 1 NA 2014 17.49 <NA> <NA> <NA>
## 2 NA 2014 39.05 <NA> <NA> <NA>
## 3 NA 2014 19.16 <NA> <NA> <NA>
## 4 NA 2014 42.55 <NA> <NA> <NA>
## 5 NA 2014 15.21 <NA> <NA> <NA>
## 6 NA 2014 32.55 <NA> <NA> <NA>
## note_classif.label note_indicator note_indicator.label notes_source
## 1 <NA> <NA> <NA> S3:5_T3:104
## 2 <NA> <NA> <NA> S3:5_T3:104
## 3 <NA> <NA> <NA> S3:5_T3:104
## 4 <NA> <NA> <NA> S3:5_T3:104
## 5 <NA> <NA> <NA> S3:5_T3:104
## 6 <NA> <NA> <NA> S3:5_T3:104
## notes_source.label
## 1 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 2 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 3 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 4 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 5 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 6 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
This data is also in a long format but there is a lot going on here. ILO is breaking down unemployment rates by sex and age within each country over time, with detailed notes on data quality and sources. This is all interesting data, but I don’t need all of it, so let me use a subset command to pull out just what I need which is the total unemployment rate across both sexes and all ages in 2014.
ilo <- subset(ilo, sex=="SEX_T" & classif1=="AGE_AGGREGATE_TOTAL" & time==2014,
select=c("ref_area", "ref_area.label","source","source.label",
"time","obs_value","note_classif","note_classif.label",
"notes_source","notes_source.label"))
head(ilo)
## # A tibble: 6 x 10
## ref_area ref_area.label source source.label time obs_value note_classif
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 ALB Albania BA:480 Labour forc… 2014 17.5 <NA>
## 2 ARG Argentina BA:153 Labour forc… 2014 7.03 <NA>
## 3 ARG Argentina BA:150 Labour forc… 2014 7.27 <NA>
## 4 ARM Armenia BA:867 Labour forc… 2014 17.6 <NA>
## 5 AUS Australia BA:518 Labour forc… 2014 6.07 <NA>
## 6 AUT Austria BA:536 Labour forc… 2014 5.62 <NA>
## # … with 3 more variables: note_classif.label <chr>, notes_source <chr>,
## # notes_source.label <chr>
You will notice right away that we have multiple entries for at least one country (Argentina). We can check this more formally by running a table command on ref_area.label
or ref_area
.
##
## Argentina Belarus Brazil China Kazakhstan Singapore
## 2 2 3 2 2 2
A few cases have more than one entry. Lets check out Brazil more closely as it has three entries:
## ref_area ref_area.label source source.label time obs_value
## 1 BRA Brazil BA:3047 Labour force survey 2014 4.85
## 2 BRA Brazil BA:356 Labour force survey 2014 6.91
## 3 BRA Brazil BA:6355 Labour force survey 2014 6.81
## note_classif note_classif.label notes_source
## 1 <NA> <NA> S4:33_T2:239
## 2 <NA> <NA> S3:20_T2:239
## 3 C6:1059 Nonstandard age group: Including age 14 T2:83
## notes_source.label
## 1 Geographical coverage: Main cities or metropolitan areas | Age coverage - minimum age: 10 years old
## 2 Data reference period: September | Age coverage - minimum age: 10 years old
## 3 Age coverage - minimum age: 14 years old
It seems that there are multiple sources for each unemployment rate. The notes provide some indication of how these different numbers were developed. There is no easy programmatic way to make a decision about which source to choose. If I were doing a real research project, I would carefully study the documentation here and reach a decision about each country separately. For our purposes, it seems like the broadest scope measure is generally in the first source listed for a country, so I will just take the first row for countries with multiple rows. I can do this easily in R using the duplicated command to remove repeat country entries. The duplicated command will return a TRUE for values in a vector that duplicate earlier entries. I will also restrict this data to the variable that I actually care about, along with the country names and codes.
ilo <- subset(ilo, !duplicated(ref_area),
select=c("ref_area","ref_area.label","obs_value"))
head(ilo)
## # A tibble: 6 x 3
## ref_area ref_area.label obs_value
## <chr> <chr> <dbl>
## 1 ALB Albania 17.5
## 2 ARG Argentina 7.03
## 3 ARM Armenia 17.6
## 4 AUS Australia 6.07
## 5 AUT Austria 5.62
## 6 AZE Azerbaijan 4.91
##
## ALB ARG ARM AUS AUT AZE BEL BGR BIH BLR BMU BOL BRA BRB BTN CAN CHE CHL CHN COL
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## CRI CUB CUW CYM CYP CZE DEU DNK DOM DZA ECU EGY ESP EST ETH FIN FRA GBR GEO GRC
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## GTM GUM HKG HRV HUN IDN IMN IND IRL IRN ISL ISR ITA JAM JEY JOR JPN KAZ KGZ KHM
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## KOR KOS LKA LTU LUX LVA MAC MAR MDA MDV MEX MKD MLT MNE MNG MOZ MUS MYS NLD NOR
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## NPL NZL PAK PAN PER PHL POL PRI PRT PRY PSE QAT ROU RUS SAU SGP SMR SRB SUR SVK
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## SVN SWE THA TTO TUR TWN TZA UKR URY USA VEN VNM WSM ZAF ZWE
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Now, everything looks good. This dataset is still technically in a long format, but since I only have one variable (total unemployment rate), the difference between a long and a wide format is now just in the variable names. Changing the variable names gives me the format I want:
## # A tibble: 6 x 3
## country_code country_name unemployment
## <chr> <chr> <dbl>
## 1 ALB Albania 17.5
## 2 ARG Argentina 7.03
## 3 ARM Armenia 17.6
## 4 AUS Australia 6.07
## 5 AUT Austria 5.62
## 6 AZE Azerbaijan 4.91
I now have the World Bank and ILO data in similar formats with a country code, country name, and then the variables of interest. I now just need to combine them together into a single dataset. For that I need to learn how to merge data.
Merging data
In order to merge datasets, you need some kind of id variable(s) that can be used to link observations across datasets. In my data here, both sources use a three letter capitalized code to identify countries. I am crossing my fingers that they are using the same coding system and that I can use this code to link countries across the two datasets.
Generally speaking, there are two different kinds of merges one can perform. In a one-to-one merge, each observation in one dataset should be linked to one and only one observation in the other dataset. In this case, the identifier used to link the two datasets should be unique (not be repeated) within each dataset. In this example, I am performing a one-to-one merge. I can check that my identifiers are unique by running the duplicated
command:
## [1] 0
## [1] 0
No duplicates, so we are of to a good start.
The other kind of merge that can be performed is a many to one merge. In this case, multiple observations in one dataset would all be linked to a single observation in the other dataset. This is most common for multilevel data in which you are linking lower-level observations to the higher level unit they are nested within. For example, I might want to link the academic records of elementary school students to information about their teachers. In this case, multiple students would be linked to the same teacher data. We will see more examples of this kind of merging next term, when we delve into multilevel models. In practice, R handles both of these merge types with the same basic syntax.
The merge
command in R will take two different data.frame objects and merge them together by matching the columns identified in the by
,by.x
, or by.y
option. Notice that I renamed the country codes in my datasets to both be country_code
in order to simplify my merging by allowing me to use the by
option rather than separate by.x
and by.y
options. It is also better to match on these three letter codes than the country_name
variable which is more likely to vary between the two datasets (one dataset may refer to “Ivory Coast” while another refers to “Cote d’Ivoire”, for example). You can also not specificy any id variable for the matching, and merge
will then try to match on every variable with the same name in the two datasets - this is not recommended.
Here is the basic merge
command.
Merging can be tricky and you should always check and double-check your results to make sure the merge worked properly. In this case, I have saved my merged data.frame as a new object named test
. The all.x
and all.y
option indicates that the final merged object should include rows for both the first and second data.frame even if they didn’t find a match in the other data.frame.
Lets begin our checks by examining the number of observations in our datasets.
## [1] 217
## [1] 115
## [1] 223
There is something odd going on here. The World Bank data included 217 countries, while the ILO data only had 115 countries. So the ILO data is presumably missing a lot of countries that are in the World Bank data. We expect this because the world bank data has more coverage than the ILO data. However when we merge the data together, we now get 223 total countries which is six more than the number of countries in the World Bank data. How did we end up with six more countries in total than the world bank data?
There are three possible outomes for each observation in our merged test dataset. In some cases, a match was made and there should be valid data for that observation. In other cases, an observation from the World Bank data failed to find a match in the ILO data. As I noted above, this is not problematic because we know the ILO data has lower coverage than the World Bank data. Finally, an observation from the ILO data may fail to find a match with the World Bank data. This last case is the problematic one and the one where the extra six cases are coming from.
How can we find out what these extra six cases are? R has a very nice operator called %in%
which can be used to identify cases of overlap and non-overlap between two variables. It will return a TRUE/FALSE value which we can use in this case to identify the cases in the ILO data that were not matched to World Bank data.
## # A tibble: 6 x 3
## country_code country_name unemployment
## <chr> <chr> <dbl>
## 1 IMN Isle of Man 1.98
## 2 JEY Jersey 4.6
## 3 KOS Kosovo 35.2
## 4 PSE Occupied Palestinian Territory 27.0
## 5 ROU Romania 6.8
## 6 TWN Taiwan, China 3.96
Notice that by wrapping the entire %in%
statement in a !
, I basically turned this into “not in.” In colluquial terms, we are identifying all country codes in the ILO data that are not in the World Bank country codes.
Why did these six countries fail to find a match? If I look for the corresponding countries in the World Bank data, I can see that three of these cases (Isle of Man, Kosovo, and Romania) have different codes in the two datasets. The three remaining cases don’t exist in the World Bank data (Taiwan and Palestine for political reasons).
I can fix the code for Isle of Man, Kosovo, and Romania by changing these codes in the ILO dataset.
ilo$country_code[ilo$country_code=="IMN"] <- "IMY"
ilo$country_code[ilo$country_code=="KOS"] <- "KSV"
ilo$country_code[ilo$country_code=="ROU"] <- "ROM"
Now, I should be able to re-run my merge command and pick up those three matches. Note that I am setting all.y=FALSE
because I can’t do anything about the last three cases of Palestine, Taiwan, and Jersey, so I might as well drop them.
## [1] 217
The total size is now what I expected. I can do a last summary
command to check that everything looks as expected.
## country_code country_name.x year gdp_cap
## Length:217 Length:217 Min. :2014 Min. : 286
## Class :character Class :character 1st Qu.:2014 1st Qu.: 1827
## Mode :character Mode :character Median :2014 Median : 5469
## Mean :2014 Mean : 15755
## 3rd Qu.:2014 3rd Qu.: 16676
## Max. :2014 Max. :178713
## NA's :27
## life_exp pop country_name.y unemployment
## Min. :48.93 Min. :9.893e+03 Length:217 Min. : 0.180
## 1st Qu.:65.90 1st Qu.:7.647e+05 Class :character 1st Qu.: 4.888
## Median :73.51 Median :6.061e+06 Mode :character Median : 6.855
## Mean :71.43 Mean :3.348e+07 Mean : 8.732
## 3rd Qu.:77.47 3rd Qu.:2.231e+07 3rd Qu.:10.738
## Max. :83.98 Max. :1.364e+09 Max. :35.150
## NA's :18 NA's :1 NA's :105
Note that I now have a country_name.x
and a country_name.y
variable. This is because there was a country_name
variable in both datasets. Rather than overwrite them, R has put them both in the final dataset with a subscript identifying which original dataset they came from. Sometimes this can be useful in testing out whether the merge worked correctly. To clean things up, I can remove one of these variables with a subset
command and then rename the other one.
combined <- subset(combined,
select=c("country_code","country_name.x","gdp_cap",
"life_exp","pop","unemployment"))
colnames(combined)[2] <- "country_name"
head(combined)
## country_code country_name gdp_cap life_exp pop unemployment
## 1 ABW Aruba NA 75.45110 103441 NA
## 2 ADO Andorra NA NA 72786 NA
## 3 AFG Afghanistan 633.9479 60.37446 31627506 NA
## 4 AGO Angola 5232.7623 52.26688 24227524 NA
## 5 ALB Albania 4568.5688 77.83046 2893654 17.49
## 6 ARE United Arab Emirates 44238.5998 77.36817 9086139 NA
I should note that the World Bank CSV file had notes at the bottom that weren’t part of the data. I had to open up the CSV file in Excel in order to remove this junk, and then re-save it to get data that would be machine-readable by R. This is a Very Bad but Unfortunately Common Practice.↩