Programming
R is the free and open-source version of an older program called Splus. Splus is so named because its syntax was designed to be very similar to C++, one of the most powerful programming languages used in real software development. R carries forward all of this programming capability. Because it was built as a programming language from the ground up, you can do almost anything you want with R if you know enough. We are only going to scratch the surface of the programming capabilities of R in this lab. I want to show you two important programming features in R:
- The ability to create your own custom functions
- The ability to repeat the same piece of code on new data iteratively.
An Example: Theil’s H
As an example, we are going to calculate a measure of segregation called the Information Theory Index or sometimes just Theil’s H, after its creator. Theil’s H provides a measure of the diversity in areas within a region (e.g. tracts within a city, counties within a state), relative to the overall diversity of the region. When Theil’s H equals one, there is no diversity in the subregions of a region, and when Theil’s H equals zero, the diversity in each subregion is equal to the overall diversity of the region.
In order to measure Theil’s H, one has to first calculate a measure of diversity for each sub-region and the total region. This measure of diversity is called entropy and is based on the proportions of different groups within the area. If \(p_j\) is the proportion of group \(j\) in the region and there are \(J\) total groups, then the formula for entropy (\(E\)) is given by:
\[E=\sum_{j=1}^J p_j\log(1/p_j)\] Entropy will be at its maximum value when the proportion \(p_j\) is the same for each group, and entropy will be at its minimum value of zero when the area is made up entirely of one group. Lets take a simple example where we have three groups and the first group is 60% of the population of an area and the remaining two groups are 20% each. Entropy would be:
\[E=(0.6)*\log(1/0.6)+0.2*\log(1/0.2)+0.2*log(1/0.2)=0.95\] With the natural log used here for three groups, the maximum value of entropy is \(\log(3)=1.0986123\), so this area would be considered fairly diverse.
In order to calculate Theil’s H, one has to first calculate entropy for each sub-region \(i\) (\(E_i\)) as well as the overall entropy for the whole region (\(E\)). One also needs the population totals for each sub-region (\(t_i\)) as well as the total population of the region (\(T\)). Theil’s H is then given by:
\[H=1-\sum_{i=1}^n \frac{t_i*E_i}{T*E}\]
Theil’s H is a weighted average of of how much the diversity of each sub-region varies from the total region. Higher values of H indicate more segregation in the sense that the diversity of the sub-regions is low relatively to the overall diversity of the region.
Our Data
For this example, I have data on the number of racial groups (white, black, Latino, Asian, indigenous, and other) in 2010 for each census tract in all 50 states (no Washington DC). Here is a glimpse of the data.
## # A tibble: 6 x 8
## statename tractid white latino black asian indigenous other
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20100 1601 44 217 14 13 23
## 2 Alabama 20200 844 75 1214 5 5 27
## 3 Alabama 20300 2538 87 647 17 14 70
## 4 Alabama 20400 4030 85 191 18 14 48
## 5 Alabama 20500 8438 355 1418 295 50 210
## 6 Alabama 20600 2672 176 738 6 10 66
Each observation here is a census tract and we have census tracts identified by state. To see the number of census tracts within each state:
##
## Alabama Alaska Arizona Arkansas California
## 1178 167 1520 686 8024
## Colorado Connecticut Delaware Florida Georgia
## 1242 828 214 4182 1957
## Hawaii Idaho Illinois Indiana Iowa
## 321 298 3115 1507 823
## Kansas Kentucky Louisiana Maine Maryland
## 766 1110 1129 351 1390
## Massachusetts Michigan Minnesota Mississippi Missouri
## 1467 2756 1334 659 1391
## Montana Nebraska Nevada New Hampshire New Jersey
## 271 532 680 292 2002
## New Mexico New York North Carolina North Dakota Ohio
## 498 4870 2175 205 2943
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1046 826 3210 241 1091
## South Dakota Tennessee Texas Utah Vermont
## 222 1489 5238 585 183
## Virginia Washington West Virginia Wisconsin Wyoming
## 1886 1445 484 1392 131
In order to calculate Theil’s H for each state, we will need to:
- Create code that calculates Theil’s H for the tract data from a single state.
- Iteratively run that code for each state.
Calculating Theil’s H for a single state
Clearly we need to do some work here that is going to be iterative. We want to calculate Theil’s H for each state, so however we do that calculation we will need to repeat it for all fifty states. Below, we will learn easy ways to make our work iterative, but the first step here is to make sure we understand how to do this calculation for a single case. Once we are satisfied with how we handled that case, we can learn how to generalize our code and make it iterative.
To get started, I am going to use Alabama as an example. Let me first pull out these census tracts and save them to a different object.
## # A tibble: 6 x 8
## statename tractid white latino black asian indigenous other
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20100 1601 44 217 14 13 23
## 2 Alabama 20200 844 75 1214 5 5 27
## 3 Alabama 20300 2538 87 647 17 14 70
## 4 Alabama 20400 4030 85 191 18 14 48
## 5 Alabama 20500 8438 355 1418 295 50 210
## 6 Alabama 20600 2672 176 738 6 10 66
The first step is to calculate the proportion of each racial group within each census tract. There are number of ways to do this, but the most straightforward approach would be to just use a prop.table
on a subset of the data that just includes the counts, being sure to specify that I want proportions within rows. There is one caveat to this approach. The prop.table
command expects a matrix or an array object not a data.frame so I need to recast my subset into a matrix with the as.matrix
command.
props <- prop.table(as.matrix(alabama_tracts[,c("white","latino","black","asian","indigenous","other")]), 1)
head(props)
## white latino black asian indigenous other
## [1,] 0.8373431 0.02301255 0.11349372 0.007322176 0.006799163 0.01202929
## [2,] 0.3889401 0.03456221 0.55944700 0.002304147 0.002304147 0.01244240
## [3,] 0.7524459 0.02579306 0.19181737 0.005040024 0.004150608 0.02075304
## [4,] 0.9188326 0.01937984 0.04354765 0.004103967 0.003191974 0.01094391
## [5,] 0.7837637 0.03297418 0.13171094 0.027401077 0.004644250 0.01950585
## [6,] 0.7284624 0.04798255 0.20119956 0.001635769 0.002726281 0.01799346
Lets make sure that the proportions add up to one in each row:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
The next step is to use these proportions to calculate entropy for each tract. The only hard part about this part is summing up across groups within the same tract. I can do this again with the apply
command:
## # A tibble: 6 x 9
## statename tractid white latino black asian indigenous other entropy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20100 1601 44 217 14 13 23 0.606
## 2 Alabama 20200 844 75 1214 5 5 27 0.891
## 3 Alabama 20300 2538 87 647 17 14 70 0.755
## 4 Alabama 20400 4030 85 191 18 14 48 0.381
## 5 Alabama 20500 8438 355 1418 295 50 210 0.771
## 6 Alabama 20600 2672 176 738 6 10 66 0.798
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4405 0.6734 0.6376 0.8293 1.3033
The na.rm=TRUE
is important here because groups with a proportion of zero need to be ignored since the log of zero is not defined.
In order to calculate Theil’s H, I also need a couple of other pieces of information. First, I need \(t_i\) which is the total population of each tract. I can get this with the apply
command.
alabama_tracts$total <- apply(alabama_tracts[,c("white","latino","black","asian","indigenous","other")],
1, sum)
head(alabama_tracts)
## # A tibble: 6 x 10
## statename tractid white latino black asian indigenous other entropy total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20100 1601 44 217 14 13 23 0.606 1912
## 2 Alabama 20200 844 75 1214 5 5 27 0.891 2170
## 3 Alabama 20300 2538 87 647 17 14 70 0.755 3373
## 4 Alabama 20400 4030 85 191 18 14 48 0.381 4386
## 5 Alabama 20500 8438 355 1418 295 50 210 0.771 10766
## 6 Alabama 20600 2672 176 738 6 10 66 0.798 3668
Now I just need to calculate entropy for the whole state of Alabama. I can do this by using apply
again to sum up tract populations by race and then using prop.table
again.
props_alabama <- prop.table(apply(alabama_tracts[,c("white","latino","black",
"asian","indigenous","other")],
2, sum))
sum(props_alabama)
## [1] 1
## [1] 0.8825423
Ok, now I have all the pieces that I need to calculate Theil’s H for alabama:
## [1] 0.2675507
So, to re-capitulate, here is the code altogether that will calculate entropy for Alabama:
state_tracts <- subset(tracts, statename=="Alabama")
props <- prop.table(as.matrix(state_tracts[,c("white","latino","black",
"asian","indigenous","other")]), 1)
state_tracts$entropy <- apply(props*log(1/props), 1, sum, na.rm=TRUE)
state_tracts$total <- apply(state_tracts[,c("white","latino","black","asian",
"indigenous","other")], 1, sum)
props_state <- prop.table(apply(state_tracts[,c("white","latino","black",
"asian","indigenous","other")], 2, sum))
entropy_state <- sum(props_state*log(1/props_state))
theil_h <- 1-sum(state_tracts$total * state_tracts$entropy)/
(sum(state_tracts$total)*entropy_state)
theil_h
## [1] 0.2675507
Notice that I have changed all of the names of objects I create here that contain “alabama” to a more general “state” name. The only place where “Alabama” is referenced is in the first line of the code. This makes the code very re-usable. I can just change the state name that is called in the first subset
command and the code will produce a Theil’s H for that state. Lets try that with Oregon:
state_tracts <- subset(tracts, statename=="Oregon")
props <- prop.table(as.matrix(state_tracts[,c("white","latino","black",
"asian","indigenous","other")]), 1)
state_tracts$entropy <- apply(props*log(1/props), 1, sum, na.rm=TRUE)
state_tracts$total <- apply(state_tracts[,c("white","latino","black","asian",
"indigenous","other")], 1, sum)
props_state <- prop.table(apply(state_tracts[,c("white","latino","black",
"asian","indigenous","other")], 2, sum))
entropy_state <- sum(props_state*log(1/props_state))
theil_h <- 1-sum(state_tracts$total * state_tracts$entropy)/
(sum(state_tracts$total)*entropy_state)
theil_h
## [1] 0.1135007
Creating Functions
In cases like this where you will end up re-using the same code on different but identically structured objects, I strongly recommend that you create your own custom function. By creating a custom function, you will only have to write the code you need in one place. Also, if you realize there are bugs in the code later, you only need to fix it in one place.
Custom functions can be created with the following syntax:
myfunctionname <- function(argument1, argument2, ...) {
#a bunch of code to run on the objects argument1 and argument2
return(value)
}
In the call to function
you can define how many arguments you wish to feed into the function. Then within the curly brackets you can declare code to run on the objects as you named them. Any object referenced within the curly brackets must be an object identified in the initial function call. The last call within the curly brackets is usually the return
command which will return whatever value or object you want the function to produce. Here is my code to create a function named calculateTheilH
.
calculateTheilH <- function(state_tracts) {
props <- prop.table(as.matrix(state_tracts[,c("white","latino","black",
"asian","indigenous","other")]),
1)
state_tracts$entropy <- apply(props*log(1/props), 1, sum, na.rm=TRUE)
state_tracts$total <- apply(state_tracts[,c("white","latino","black","asian",
"indigenous","other")],
1, sum)
props_state <- prop.table(apply(state_tracts[,c("white","latino","black",
"asian","indigenous","other")],
2, sum))
entropy_state <- sum(props_state*log(1/props_state))
theil_h <- 1-sum(state_tracts$total * state_tracts$entropy)/(sum(state_tracts$total)*entropy_state)
return(theil_h)
}
The function feeds in an object called state_tracts
which should be the data.frame of tracts for a given state. The function then runs the same lines of code that I created above. Lets try it out on a couple of states.
## [1] 0.2675507
## [1] 0.1135007
## [1] 0.2356094
## [1] 0.3633379
I have now simplified my code considerably. Without the function I would have had to copy and paste the code for calculating entropy fifty different times. If I then changed or corrected something later in that code, it would have been a nightmare to fix it for all fifty versions. Now I can keep all that code in one place and I just need to type one line for each state to calculate Theil’s H. But I can improve this even more by iteration.
Iteration
Basically, I want to iterate over states and calculate Theil’s H for each state. Computers are great at this kind of iterative process. Iterative methods are very common in real programming languages because they can handle problems where a lot of repitition is necessary and computers can typically execute them quickly. The way this is handled in most programming languages is by looping. In R, we can solve our iteration problem with a for-loop.
A for-loop will sequentially move through all of the values in a vector and run the code within the curly brackets each time. Usually, this code will reference the particular value of the vector.
Here is a very simple example of a for-loop that iterates through all the unique names of states and prints them out.
## [1] "Alabama"
## [1] "Alaska"
## [1] "Arizona"
## [1] "Arkansas"
## [1] "California"
## [1] "Colorado"
## [1] "Connecticut"
## [1] "Delaware"
## [1] "Florida"
## [1] "Georgia"
## [1] "Hawaii"
## [1] "Idaho"
## [1] "Illinois"
## [1] "Indiana"
## [1] "Iowa"
## [1] "Kansas"
## [1] "Kentucky"
## [1] "Louisiana"
## [1] "Maine"
## [1] "Maryland"
## [1] "Massachusetts"
## [1] "Michigan"
## [1] "Minnesota"
## [1] "Mississippi"
## [1] "Missouri"
## [1] "Montana"
## [1] "Nebraska"
## [1] "Nevada"
## [1] "New Hampshire"
## [1] "New Jersey"
## [1] "New Mexico"
## [1] "New York"
## [1] "North Carolina"
## [1] "North Dakota"
## [1] "Ohio"
## [1] "Oklahoma"
## [1] "Oregon"
## [1] "Pennsylvania"
## [1] "Rhode Island"
## [1] "South Carolina"
## [1] "South Dakota"
## [1] "Tennessee"
## [1] "Texas"
## [1] "Utah"
## [1] "Vermont"
## [1] "Virginia"
## [1] "Washington"
## [1] "West Virginia"
## [1] "Wisconsin"
## [1] "Wyoming"
This is not very useful, but it does give us a hint of a for-loop that would be more useful. We could iterate through unique state names and then use each state name to define a subset of tracts from that state and apply our calculateTheilH function:
## [1] 0.2675507
## [1] 0.1682601
## [1] 0.2372527
## [1] 0.2664347
## [1] 0.2356094
## [1] 0.1620947
## [1] 0.2333971
## [1] 0.1489769
## [1] 0.2653847
## [1] 0.2448094
## [1] 0.1145708
## [1] 0.104742
## [1] 0.3701819
## [1] 0.2661208
## [1] 0.1608354
## [1] 0.1873829
## [1] 0.2045492
## [1] 0.2501826
## [1] 0.08692125
## [1] 0.2872802
## [1] 0.2363536
## [1] 0.3383812
## [1] 0.185217
## [1] 0.2272864
## [1] 0.291448
## [1] 0.2111407
## [1] 0.2229126
## [1] 0.1324474
## [1] 0.08889904
## [1] 0.2874875
## [1] 0.2334191
## [1] 0.3633379
## [1] 0.2035033
## [1] 0.2375026
## [1] 0.2930075
## [1] 0.1489998
## [1] 0.1135007
## [1] 0.3331475
## [1] 0.2328076
## [1] 0.1739929
## [1] 0.2732335
## [1] 0.3013521
## [1] 0.2681015
## [1] 0.1120612
## [1] 0.0518904
## [1] 0.2046499
## [1] 0.1501689
## [1] 0.1359242
## [1] 0.3002023
## [1] 0.1187653
That worked! We just have one problem. These results are just being printed to the console. We want to save them to a new vector. We can do this by initializing a NULL
vector and then adding each calculated value to this vector with the concatenate command:
theil_h <- NULL
for(state in unique(tracts$statename)) {
theil_h <- c(theil_h, calculateTheilH(subset(tracts, statename==state)))
}
length(theil_h)
## [1] 50
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05189 0.15284 0.23005 0.21467 0.26727 0.37018
Ok, so now we have a vector of length 50 that gives the Theil’s H for each state. As a final touch-up step, we can go ahead and put this back into a data frame with a state id.
## statename theil_h
## 1 Alabama 0.2675507
## 2 Alaska 0.1682601
## 3 Arizona 0.2372527
## 4 Arkansas 0.2664347
## 5 California 0.2356094
## 6 Colorado 0.1620947
The lapply
command
It turns out that R is not very efficient with iterative loops. So, while iterative looping can help solve problems, its better to look for a different solution before you start using iteration in R. That different solution is usually the lapply
command or its closely related cousin sapply
.
The lapply
command will apply a given function to every object in a list. If we were to put all of our state-specific tract datasets into a list, we could run an lapply
command calling calculateTheilH
on the entire list at once. Generally, this will be much faster than a for-loop. The code would look something like:
The only trick is figuring out how to create a list of state tract data where each object in the list is a data frame of tract data for a specific state. It turns out that the split
command is just what we need. It will split an object into a list of the same objects by some factor variable. In our case:
## [1] 50
## # A tibble: 6 x 8
## statename tractid white latino black asian indigenous other
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20100 1601 44 217 14 13 23
## 2 Alabama 20200 844 75 1214 5 5 27
## 3 Alabama 20300 2538 87 647 17 14 70
## 4 Alabama 20400 4030 85 191 18 14 48
## 5 Alabama 20500 8438 355 1418 295 50 210
## 6 Alabama 20600 2672 176 738 6 10 66
The state_list
object is a list where each object is the tract data.frame for a given state. Now we can run our lapply
command from above:
## [1] "list"
## [1] 0.2675507
The only downside to the lapply
function is that it outputs the results as a list as well. In this case, where the output of our function is a just a simple number, we would like the output to just be a simple vector. It turns out the sapply
function can do this. The sapply
function is just a wrapper to lapply
that converts the output when possible to a simple vector or matrix format.
## [1] "numeric"
## Alabama Alaska Arizona Arkansas California Colorado
## 0.2675507 0.1682601 0.2372527 0.2664347 0.2356094 0.1620947
Putting It All Together
Without the use of functions and iterative methods, calculating this measure would have been a nightmare. We would have had to copy and paste code fifty different times for each state. Instead, we have been able to reduce the entire process to just a few lines of code, most of which is just code for the function:
calculateTheilH <- function(state_tracts) {
props <- prop.table(as.matrix(state_tracts[,c("white","latino","black",
"asian","indigenous","other")]),
1)
state_tracts$entropy <- apply(props*log(1/props), 1, sum, na.rm=TRUE)
state_tracts$total <- apply(state_tracts[,c("white","latino","black","asian",
"indigenous","other")],
1, sum)
props_state <- prop.table(apply(state_tracts[,c("white","latino","black",
"asian","indigenous","other")],
2, sum))
entropy_state <- sum(props_state*log(1/props_state))
theil_h <- 1-sum(state_tracts$total * state_tracts$entropy)/
(sum(state_tracts$total)*entropy_state)
return(theil_h)
}
tracts <- read_csv("resources/tracts_state.csv")
theil_h <- sapply(split(tracts, tracts$statename), calculateTheilH)
theil_h <- data.frame(statename=names(theil_h), theil_h)
rownames(theil_h) <- NULL
head(theil_h)
## statename theil_h
## 1 Alabama 0.2675507
## 2 Alaska 0.1682601
## 3 Arizona 0.2372527
## 4 Arkansas 0.2664347
## 5 California 0.2356094
## 6 Colorado 0.1620947