Reshaping and Merging Data

In this lab, I want to cover two important components of data wrangling, reshaping data from one format to one another and merging two diffent datasets together. As an example for this section, we are going to use some data from The World Bank Development Indicators and the International Labour Organization. Both of these sites have nice designs that allow users to filter an extract of their choosing by country, year, and variable type. They also allow download of the data in a variety of formats including CSV. The labor data provides information on unemployment rates by age and sex and I have limited the years to 2014 or later. I then pulled an extract of the world bank data from 2014 on all countries for the variables of GDP per capita, life expectancy at birth, and population size. You can download the World Bank extract here and the ILO data here.¹

Ultimately, I want to create a final dataset which combined these sources. The final dataset should have an observation for each country with the variables of GDP per capita, life expectancy at birth, population size, and unemployment rates. In order to get there though, I am going to have to deal with some data issues.

Reshaping

All of the data that we have looked at so far have been in what is called a wide format. Each row is an observation is a column. This is usually the easiest way for us to work with data, but sometimes data can come in a long format instead. In the long format, each row gives the value for a certain variable for a certain observation. The observation and variable are identified by id columns and the values themselves are a third column.

To give a simple example, here is a very small dataset in wide form:

name	age	height
Bob	25	71
Susie	37	65
Linda	28	68

Here is that same dataset in long form:

name	variable	value
Bob	age	25
Bob	height	71
Susie	age	37
Susie	height	65
Linda	age	28
Linda	height	68

Notice that in the long format each person shows up once for each variable from the wide dataset. In this case, the “name” variable is my id variable that identified unique observations. Its possible to have more than one id variable in the long format. For example, I might also have a year variable which identifies an observation-year case.

Lets take a look at the world bank data that I have and you will see another example of the long format.

library(readr)
worldbank <- read_csv("resources/worldbank.csv")
#better variable names
colnames(worldbank) <- c("country_name","country_code","series_name","series_code",
                         "time","time_code","value")
head(worldbank)

## # A tibble: 6 x 7
##   country_name country_code series_name      series_code   time time_code  value
##   <chr>        <chr>        <chr>            <chr>        <dbl> <chr>      <dbl>
## 1 Afghanistan  AFG          GDP per capita … NY.GDP.PCAP…  2014 YR2014    6.34e2
## 2 Afghanistan  AFG          Life expectancy… SP.DYN.LE00…  2014 YR2014    6.04e1
## 3 Afghanistan  AFG          Population, tot… SP.POP.TOTL   2014 YR2014    3.16e7
## 4 Albania      ALB          GDP per capita … NY.GDP.PCAP…  2014 YR2014    4.57e3
## 5 Albania      ALB          Life expectancy… SP.DYN.LE00…  2014 YR2014    7.78e1
## 6 Albania      ALB          Population, tot… SP.POP.TOTL   2014 YR2014    2.89e6

You can see that this data is in long format with the actual value for each of my variables given in the “Value” colum. The “Country Code” variable identifies the country and the “Series Code” variable identifies the specific variable. I also have a “Time Code” but in my case that is not particularly relevant since I only downloaded 2014 data. This is the standard format of the World Bank’s World Development Indicators data.

In order to create my analytical dataset I want to reshape this data from long to wide. Given what we know about data.frames in R, it is actually possible to manipulate this dataset with our existing skills to get it in the right format, but this is not the best approach because it opens up to significant errors. We could for example, use a subset command to just grab each variable’s value and save those two different vectors. We could then use a data.frame command or a cbind command to comnbine those together into a new data.frame. However, we would be assuming that the ordering and existence of countries was consistent across all of the vectors that we are combining - and this assumption could prove dangerous if it turned out that in one vector country values were reversed or some were missing. We would then have incorrect values for some countries.

A better approach would be to save subsets that include the country code and then use the merge command that we will learn below to merge them together into a single dataset by country code. This will ensure that the values are correct for countries, but it still involves quite a bit of code.

An even better solution is to use one of the functions in R that can do this reshaping for us. Base R comes with a reshape command but we will not be using that command here. Instead we are going to use the dcast function from the reshape2 library. You will need to install the reshape2 library to gain access to it.

The dcast function in the reshape2 library is one of the “cast” functions that will turn a dataset in long format to a dataset in wide format. There is an analagous function called melt to for from wide to long format (melting and casting, get it?). To use dcast we need to feed in three arguments:

the name of the data.frame object in long form
A formula that defines the id variable(s) and the variable that identifies distinct variables from the wide format.
A values.var variable that identifies which variable contains the actual values.

In my case, it should look like:

library(reshape2)
worldbank_wide <- dcast(worldbank, country_code + country_name + time ~ series_code, 
                        value.var = "value")
head(worldbank_wide)

##   country_code         country_name time NY.GDP.PCAP.CD SP.DYN.LE00.IN
## 1          ABW                Aruba 2014             NA       75.45110
## 2          ADO              Andorra 2014             NA             NA
## 3          AFG          Afghanistan 2014       633.9479       60.37446
## 4          AGO               Angola 2014      5232.7623       52.26688
## 5          ALB              Albania 2014      4568.5688       77.83046
## 6          ARE United Arab Emirates 2014     44238.5998       77.36817
##   SP.POP.TOTL
## 1      103441
## 2       72786
## 3    31627506
## 4    24227524
## 5     2893654
## 6     9086139

Just like that we have data in wide format. Notice how the formula works here. My id variable(s) go to the left of the tilde. In this case, I didn’t really need to specify “time” because I only have one year, but I wanted to show you how this can work with multiple id variables. I also didn’t need both “country_code” and “country_name” but by doing it this way, I will get both variables in my wide data format, which can be useful if I need to look up which country is which code. The series_code that identifies variable names for the wide dataset goes to the right of the tilde.

My only problem now is that the variable names are ugly, but I can fix that up pretty easily:

colnames(worldbank_wide) <- c("country_code","country_name","year",
                              "gdp_cap","life_exp","pop")
head(worldbank_wide)

##   country_code         country_name year    gdp_cap life_exp      pop
## 1          ABW                Aruba 2014         NA 75.45110   103441
## 2          ADO              Andorra 2014         NA       NA    72786
## 3          AFG          Afghanistan 2014   633.9479 60.37446 31627506
## 4          AGO               Angola 2014  5232.7623 52.26688 24227524
## 5          ALB              Albania 2014  4568.5688 77.83046  2893654
## 6          ARE United Arab Emirates 2014 44238.5998 77.36817  9086139

Now lets take a look at the ILO data and see what needs to be done with it. Note that I needed to specify a few column types that were giving readr parsing problems, although we don’t really care about those variables anyway. I also recast the tibble to a standard data.frame for the head command so you can see all of the variables.

ilo <- read_csv("resources/ilostat.csv",
                col_types=cols(obs_status=col_character(),
                               obs_status.label=col_character()))
head(as.data.frame(ilo))

##   collection  collection.label ref_area ref_area.label source
## 1         YI Yearly indicators      ALB        Albania BA:480
## 2         YI Yearly indicators      ALB        Albania BA:480
## 3         YI Yearly indicators      ALB        Albania BA:480
## 4         YI Yearly indicators      ALB        Albania BA:480
## 5         YI Yearly indicators      ALB        Albania BA:480
## 6         YI Yearly indicators      ALB        Albania BA:480
##          source.label           indicator                      indicator.label
## 1 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 2 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 3 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 4 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 5 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
## 6 Labour force survey UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%)
##     sex   sex.label             classif1           classif1.label classif2
## 1 SEX_T  Sex: Total  AGE_10YRBANDS_TOTAL 10-year age bands: Total       NA
## 2 SEX_T  Sex: Total AGE_10YRBANDS_Y15-24 10-year age bands: 15-24       NA
## 3 SEX_M   Sex: Male  AGE_10YRBANDS_TOTAL 10-year age bands: Total       NA
## 4 SEX_M   Sex: Male AGE_10YRBANDS_Y15-24 10-year age bands: 15-24       NA
## 5 SEX_F Sex: Female  AGE_10YRBANDS_TOTAL 10-year age bands: Total       NA
## 6 SEX_F Sex: Female AGE_10YRBANDS_Y15-24 10-year age bands: 15-24       NA
##   classif2.label classif3 classif3.label classif4 classif4.label classif5
## 1             NA       NA             NA       NA             NA       NA
## 2             NA       NA             NA       NA             NA       NA
## 3             NA       NA             NA       NA             NA       NA
## 4             NA       NA             NA       NA             NA       NA
## 5             NA       NA             NA       NA             NA       NA
## 6             NA       NA             NA       NA             NA       NA
##   classif5.label time obs_value obs_status obs_status.label note_classif
## 1             NA 2014     17.49       <NA>             <NA>         <NA>
## 2             NA 2014     39.05       <NA>             <NA>         <NA>
## 3             NA 2014     19.16       <NA>             <NA>         <NA>
## 4             NA 2014     42.55       <NA>             <NA>         <NA>
## 5             NA 2014     15.21       <NA>             <NA>         <NA>
## 6             NA 2014     32.55       <NA>             <NA>         <NA>
##   note_classif.label note_indicator note_indicator.label notes_source
## 1               <NA>           <NA>                 <NA>  S3:5_T3:104
## 2               <NA>           <NA>                 <NA>  S3:5_T3:104
## 3               <NA>           <NA>                 <NA>  S3:5_T3:104
## 4               <NA>           <NA>                 <NA>  S3:5_T3:104
## 5               <NA>           <NA>                 <NA>  S3:5_T3:104
## 6               <NA>           <NA>                 <NA>  S3:5_T3:104
##                                                                           notes_source.label
## 1 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 2 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 3 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 4 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 5 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old
## 6 Data reference period: Annual or annual average | Age coverage - maximum age: 74 years old

This data is also in a long format but there is a lot going on here. ILO is breaking down unemployment rates by sex and age within each country over time, with detailed notes on data quality and sources. This is all interesting data, but I don’t need all of it, so let me use a subset command to pull out just what I need which is the total unemployment rate across both sexes and all ages in 2014.

ilo <- subset(ilo, sex=="SEX_T" & classif1=="AGE_AGGREGATE_TOTAL" & time==2014,
              select=c("ref_area", "ref_area.label","source","source.label",
                       "time","obs_value","note_classif","note_classif.label",
                       "notes_source","notes_source.label"))
head(ilo)

## # A tibble: 6 x 10
##   ref_area ref_area.label source source.label  time obs_value note_classif
##   <chr>    <chr>          <chr>  <chr>        <dbl>     <dbl> <chr>       
## 1 ALB      Albania        BA:480 Labour forc…  2014     17.5  <NA>        
## 2 ARG      Argentina      BA:153 Labour forc…  2014      7.03 <NA>        
## 3 ARG      Argentina      BA:150 Labour forc…  2014      7.27 <NA>        
## 4 ARM      Armenia        BA:867 Labour forc…  2014     17.6  <NA>        
## 5 AUS      Australia      BA:518 Labour forc…  2014      6.07 <NA>        
## 6 AUT      Austria        BA:536 Labour forc…  2014      5.62 <NA>        
## # … with 3 more variables: note_classif.label <chr>, notes_source <chr>,
## #   notes_source.label <chr>

You will notice right away that we have multiple entries for at least one country (Argentina). We can check this more formally by running a table command on ref_area.label or ref_area.

temp <- table(ilo$ref_area.label)
temp[temp>1]

## 
##  Argentina    Belarus     Brazil      China Kazakhstan  Singapore 
##          2          2          3          2          2          2

A few cases have more than one entry. Lets check out Brazil more closely as it has three entries:

head(as.data.frame(subset(ilo, ref_area.label=="Brazil")))

##   ref_area ref_area.label  source        source.label time obs_value
## 1      BRA         Brazil BA:3047 Labour force survey 2014      4.85
## 2      BRA         Brazil  BA:356 Labour force survey 2014      6.91
## 3      BRA         Brazil BA:6355 Labour force survey 2014      6.81
##   note_classif                      note_classif.label notes_source
## 1         <NA>                                    <NA> S4:33_T2:239
## 2         <NA>                                    <NA> S3:20_T2:239
## 3      C6:1059 Nonstandard age group: Including age 14        T2:83
##                                                                                    notes_source.label
## 1 Geographical coverage: Main cities or metropolitan areas | Age coverage - minimum age: 10 years old
## 2                         Data reference period: September | Age coverage - minimum age: 10 years old
## 3                                                            Age coverage - minimum age: 14 years old

It seems that there are multiple sources for each unemployment rate. The notes provide some indication of how these different numbers were developed. There is no easy programmatic way to make a decision about which source to choose. If I were doing a real research project, I would carefully study the documentation here and reach a decision about each country separately. For our purposes, it seems like the broadest scope measure is generally in the first source listed for a country, so I will just take the first row for countries with multiple rows. I can do this easily in R using the duplicated command to remove repeat country entries. The duplicated command will return a TRUE for values in a vector that duplicate earlier entries. I will also restrict this data to the variable that I actually care about, along with the country names and codes.

ilo <- subset(ilo, !duplicated(ref_area),
                  select=c("ref_area","ref_area.label","obs_value"))
head(ilo)

## # A tibble: 6 x 3
##   ref_area ref_area.label obs_value
##   <chr>    <chr>              <dbl>
## 1 ALB      Albania            17.5 
## 2 ARG      Argentina           7.03
## 3 ARM      Armenia            17.6 
## 4 AUS      Australia           6.07
## 5 AUT      Austria             5.62
## 6 AZE      Azerbaijan          4.91

table(ilo$ref_area)

## 
## ALB ARG ARM AUS AUT AZE BEL BGR BIH BLR BMU BOL BRA BRB BTN CAN CHE CHL CHN COL 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## CRI CUB CUW CYM CYP CZE DEU DNK DOM DZA ECU EGY ESP EST ETH FIN FRA GBR GEO GRC 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## GTM GUM HKG HRV HUN IDN IMN IND IRL IRN ISL ISR ITA JAM JEY JOR JPN KAZ KGZ KHM 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## KOR KOS LKA LTU LUX LVA MAC MAR MDA MDV MEX MKD MLT MNE MNG MOZ MUS MYS NLD NOR 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## NPL NZL PAK PAN PER PHL POL PRI PRT PRY PSE QAT ROU RUS SAU SGP SMR SRB SUR SVK 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## SVN SWE THA TTO TUR TWN TZA UKR URY USA VEN VNM WSM ZAF ZWE 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

Now, everything looks good. This dataset is still technically in a long format, but since I only have one variable (total unemployment rate), the difference between a long and a wide format is now just in the variable names. Changing the variable names gives me the format I want:

colnames(ilo) <- c("country_code","country_name","unemployment")
head(ilo)

## # A tibble: 6 x 3
##   country_code country_name unemployment
##   <chr>        <chr>               <dbl>
## 1 ALB          Albania             17.5 
## 2 ARG          Argentina            7.03
## 3 ARM          Armenia             17.6 
## 4 AUS          Australia            6.07
## 5 AUT          Austria              5.62
## 6 AZE          Azerbaijan           4.91

I now have the World Bank and ILO data in similar formats with a country code, country name, and then the variables of interest. I now just need to combine them together into a single dataset. For that I need to learn how to merge data.

Merging data

In order to merge datasets, you need some kind of id variable(s) that can be used to link observations across datasets. In my data here, both sources use a three letter capitalized code to identify countries. I am crossing my fingers that they are using the same coding system and that I can use this code to link countries across the two datasets.

Generally speaking, there are two different kinds of merges one can perform. In a one-to-one merge, each observation in one dataset should be linked to one and only one observation in the other dataset. In this case, the identifier used to link the two datasets should be unique (not be repeated) within each dataset. In this example, I am performing a one-to-one merge. I can check that my identifiers are unique by running the duplicated command:

sum(duplicated(worldbank_wide$country_code))

## [1] 0

sum(duplicated(ilo$country_code))

## [1] 0

No duplicates, so we are of to a good start.

The other kind of merge that can be performed is a many to one merge. In this case, multiple observations in one dataset would all be linked to a single observation in the other dataset. This is most common for multilevel data in which you are linking lower-level observations to the higher level unit they are nested within. For example, I might want to link the academic records of elementary school students to information about their teachers. In this case, multiple students would be linked to the same teacher data. We will see more examples of this kind of merging next term, when we delve into multilevel models. In practice, R handles both of these merge types with the same basic syntax.

The merge command in R will take two different data.frame objects and merge them together by matching the columns identified in the by,by.x, or by.y option. Notice that I renamed the country codes in my datasets to both be country_code in order to simplify my merging by allowing me to use the by option rather than separate by.x and by.y options. It is also better to match on these three letter codes than the country_name variable which is more likely to vary between the two datasets (one dataset may refer to “Ivory Coast” while another refers to “Cote d’Ivoire”, for example). You can also not specificy any id variable for the matching, and merge will then try to match on every variable with the same name in the two datasets - this is not recommended.

Here is the basic merge command.

test <- merge(worldbank_wide, ilo, by="country_code", all.x=TRUE, all.y=TRUE)

Merging can be tricky and you should always check and double-check your results to make sure the merge worked properly. In this case, I have saved my merged data.frame as a new object named test. The all.x and all.y option indicates that the final merged object should include rows for both the first and second data.frame even if they didn’t find a match in the other data.frame.

Lets begin our checks by examining the number of observations in our datasets.

nrow(worldbank_wide)

## [1] 217

nrow(ilo)

## [1] 115

nrow(test)

## [1] 223

There is something odd going on here. The World Bank data included 217 countries, while the ILO data only had 115 countries. So the ILO data is presumably missing a lot of countries that are in the World Bank data. We expect this because the world bank data has more coverage than the ILO data. However when we merge the data together, we now get 223 total countries which is six more than the number of countries in the World Bank data. How did we end up with six more countries in total than the world bank data?

There are three possible outomes for each observation in our merged test dataset. In some cases, a match was made and there should be valid data for that observation. In other cases, an observation from the World Bank data failed to find a match in the ILO data. As I noted above, this is not problematic because we know the ILO data has lower coverage than the World Bank data. Finally, an observation from the ILO data may fail to find a match with the World Bank data. This last case is the problematic one and the one where the extra six cases are coming from.

How can we find out what these extra six cases are? R has a very nice operator called %in% which can be used to identify cases of overlap and non-overlap between two variables. It will return a TRUE/FALSE value which we can use in this case to identify the cases in the ILO data that were not matched to World Bank data.

subset(ilo, !(ilo$country_code %in% worldbank_wide$country_code))

## # A tibble: 6 x 3
##   country_code country_name                   unemployment
##   <chr>        <chr>                                 <dbl>
## 1 IMN          Isle of Man                            1.98
## 2 JEY          Jersey                                 4.6 
## 3 KOS          Kosovo                                35.2 
## 4 PSE          Occupied Palestinian Territory        27.0 
## 5 ROU          Romania                                6.8 
## 6 TWN          Taiwan, China                          3.96

Notice that by wrapping the entire %in% statement in a !, I basically turned this into “not in.” In colluquial terms, we are identifying all country codes in the ILO data that are not in the World Bank country codes.

Why did these six countries fail to find a match? If I look for the corresponding countries in the World Bank data, I can see that three of these cases (Isle of Man, Kosovo, and Romania) have different codes in the two datasets. The three remaining cases don’t exist in the World Bank data (Taiwan and Palestine for political reasons).

I can fix the code for Isle of Man, Kosovo, and Romania by changing these codes in the ILO dataset.

ilo$country_code[ilo$country_code=="IMN"] <- "IMY"
ilo$country_code[ilo$country_code=="KOS"] <- "KSV"
ilo$country_code[ilo$country_code=="ROU"] <- "ROM"

Now, I should be able to re-run my merge command and pick up those three matches. Note that I am setting all.y=FALSE because I can’t do anything about the last three cases of Palestine, Taiwan, and Jersey, so I might as well drop them.

combined <- merge(worldbank_wide, ilo, by="country_code", all.x=TRUE, all.y=FALSE)
nrow(combined)

## [1] 217

The total size is now what I expected. I can do a last summary command to check that everything looks as expected.

summary(combined)

##  country_code       country_name.x          year         gdp_cap      
##  Length:217         Length:217         Min.   :2014   Min.   :   286  
##  Class :character   Class :character   1st Qu.:2014   1st Qu.:  1827  
##  Mode  :character   Mode  :character   Median :2014   Median :  5469  
##                                        Mean   :2014   Mean   : 15755  
##                                        3rd Qu.:2014   3rd Qu.: 16676  
##                                        Max.   :2014   Max.   :178713  
##                                                       NA's   :27      
##     life_exp          pop            country_name.y      unemployment   
##  Min.   :48.93   Min.   :9.893e+03   Length:217         Min.   : 0.180  
##  1st Qu.:65.90   1st Qu.:7.647e+05   Class :character   1st Qu.: 4.888  
##  Median :73.51   Median :6.061e+06   Mode  :character   Median : 6.855  
##  Mean   :71.43   Mean   :3.348e+07                      Mean   : 8.732  
##  3rd Qu.:77.47   3rd Qu.:2.231e+07                      3rd Qu.:10.738  
##  Max.   :83.98   Max.   :1.364e+09                      Max.   :35.150  
##  NA's   :18      NA's   :1                              NA's   :105

Note that I now have a country_name.x and a country_name.y variable. This is because there was a country_name variable in both datasets. Rather than overwrite them, R has put them both in the final dataset with a subscript identifying which original dataset they came from. Sometimes this can be useful in testing out whether the merge worked correctly. To clean things up, I can remove one of these variables with a subset command and then rename the other one.

combined <- subset(combined,
                   select=c("country_code","country_name.x","gdp_cap",
                            "life_exp","pop","unemployment"))
colnames(combined)[2] <- "country_name"
head(combined)

##   country_code         country_name    gdp_cap life_exp      pop unemployment
## 1          ABW                Aruba         NA 75.45110   103441           NA
## 2          ADO              Andorra         NA       NA    72786           NA
## 3          AFG          Afghanistan   633.9479 60.37446 31627506           NA
## 4          AGO               Angola  5232.7623 52.26688 24227524           NA
## 5          ALB              Albania  4568.5688 77.83046  2893654        17.49
## 6          ARE United Arab Emirates 44238.5998 77.36817  9086139           NA

I should note that the World Bank CSV file had notes at the bottom that weren’t part of the data. I had to open up the CSV file in Excel in order to remove this junk, and then re-save it to get data that would be machine-readable by R. This is a Very Bad but Unfortunately Common Practice.↩