Programming

R is the free and open-source version of an older program called Splus. Splus is so named because its syntax was designed to be very similar to C++, one of the most powerful programming languages used in real software development. R carries forward all of this programming capability. Because it was built as a programming language from the ground up, you can do almost anything you want with R if you know enough. We are only going to scratch the surface of the programming capabilities of R in this lab. I want to show you two important programming features in R:

  1. The ability to create your own custom functions
  2. The ability to repeat the same piece of code on new data iteratively.

An Example: Theil’s H

As an example, we are going to calculate a measure of segregation called the Information Theory Index or sometimes just Theil’s H, after its creator. Theil’s H provides a measure of the diversity in areas within a region (e.g. tracts within a city, counties within a state), relative to the overall diversity of the region. When Theil’s H equals one, there is no diversity in the subregions of a region, and when Theil’s H equals zero, the diversity in each subregion is equal to the overall diversity of the region.

In order to measure Theil’s H, one has to first calculate a measure of diversity for each sub-region and the total region. This measure of diversity is called entropy and is based on the proportions of different groups within the area. If \(p_j\) is the proportion of group \(j\) in the region and there are \(J\) total groups, then the formula for entropy (\(E\)) is given by:

\[E=\sum_{j=1}^J p_j\log(1/p_j)\] Entropy will be at its maximum value when the proportion \(p_j\) is the same for each group, and entropy will be at its minimum value of zero when the area is made up entirely of one group. Lets take a simple example where we have three groups and the first group is 60% of the population of an area and the remaining two groups are 20% each. Entropy would be:

\[E=(0.6)*\log(1/0.6)+0.2*\log(1/0.2)+0.2*log(1/0.2)=0.95\] With the natural log used here for three groups, the maximum value of entropy is \(\log(3)=1.0986123\), so this area would be considered fairly diverse.

In order to calculate Theil’s H, one has to first calculate entropy for each sub-region \(i\) (\(E_i\)) as well as the overall entropy for the whole region (\(E\)). One also needs the population totals for each sub-region (\(t_i\)) as well as the total population of the region (\(T\)). Theil’s H is then given by:

\[H=1-\sum_{i=1}^n \frac{t_i*E_i}{T*E}\]

Theil’s H is a weighted average of of how much the diversity of each sub-region varies from the total region. Higher values of H indicate more segregation in the sense that the diversity of the sub-regions is low relatively to the overall diversity of the region.

Our Data

For this example, I have data on the number of racial groups (white, black, Latino, Asian, indigenous, and other) in 2010 for each census tract in all 50 states (no Washington DC). Here is a glimpse of the data.

## # A tibble: 6 x 8
##   statename tractid white latino black asian indigenous other
##   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 Alabama     20100  1601     44   217    14         13    23
## 2 Alabama     20200   844     75  1214     5          5    27
## 3 Alabama     20300  2538     87   647    17         14    70
## 4 Alabama     20400  4030     85   191    18         14    48
## 5 Alabama     20500  8438    355  1418   295         50   210
## 6 Alabama     20600  2672    176   738     6         10    66

Each observation here is a census tract and we have census tracts identified by state. To see the number of census tracts within each state:

## 
##        Alabama         Alaska        Arizona       Arkansas     California 
##           1178            167           1520            686           8024 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##           1242            828            214           4182           1957 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##            321            298           3115           1507            823 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##            766           1110           1129            351           1390 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##           1467           2756           1334            659           1391 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##            271            532            680            292           2002 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##            498           4870           2175            205           2943 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##           1046            826           3210            241           1091 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##            222           1489           5238            585            183 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##           1886           1445            484           1392            131

In order to calculate Theil’s H for each state, we will need to:

  1. Create code that calculates Theil’s H for the tract data from a single state.
  2. Iteratively run that code for each state.

Calculating Theil’s H for a single state

Clearly we need to do some work here that is going to be iterative. We want to calculate Theil’s H for each state, so however we do that calculation we will need to repeat it for all fifty states. Below, we will learn easy ways to make our work iterative, but the first step here is to make sure we understand how to do this calculation for a single case. Once we are satisfied with how we handled that case, we can learn how to generalize our code and make it iterative.

To get started, I am going to use Alabama as an example. Let me first pull out these census tracts and save them to a different object.

## # A tibble: 6 x 8
##   statename tractid white latino black asian indigenous other
##   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 Alabama     20100  1601     44   217    14         13    23
## 2 Alabama     20200   844     75  1214     5          5    27
## 3 Alabama     20300  2538     87   647    17         14    70
## 4 Alabama     20400  4030     85   191    18         14    48
## 5 Alabama     20500  8438    355  1418   295         50   210
## 6 Alabama     20600  2672    176   738     6         10    66

The first step is to calculate the proportion of each racial group within each census tract. There are number of ways to do this, but the most straightforward approach would be to just use a prop.table on a subset of the data that just includes the counts, being sure to specify that I want proportions within rows. There is one caveat to this approach. The prop.table command expects a matrix or an array object not a data.frame so I need to recast my subset into a matrix with the as.matrix command.

##          white     latino      black       asian  indigenous      other
## [1,] 0.8373431 0.02301255 0.11349372 0.007322176 0.006799163 0.01202929
## [2,] 0.3889401 0.03456221 0.55944700 0.002304147 0.002304147 0.01244240
## [3,] 0.7524459 0.02579306 0.19181737 0.005040024 0.004150608 0.02075304
## [4,] 0.9188326 0.01937984 0.04354765 0.004103967 0.003191974 0.01094391
## [5,] 0.7837637 0.03297418 0.13171094 0.027401077 0.004644250 0.01950585
## [6,] 0.7284624 0.04798255 0.20119956 0.001635769 0.002726281 0.01799346

Lets make sure that the proportions add up to one in each row:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

The next step is to use these proportions to calculate entropy for each tract. The only hard part about this part is summing up across groups within the same tract. I can do this again with the apply command:

## # A tibble: 6 x 9
##   statename tractid white latino black asian indigenous other entropy
##   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl>
## 1 Alabama     20100  1601     44   217    14         13    23   0.606
## 2 Alabama     20200   844     75  1214     5          5    27   0.891
## 3 Alabama     20300  2538     87   647    17         14    70   0.755
## 4 Alabama     20400  4030     85   191    18         14    48   0.381
## 5 Alabama     20500  8438    355  1418   295         50   210   0.771
## 6 Alabama     20600  2672    176   738     6         10    66   0.798
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4405  0.6734  0.6376  0.8293  1.3033

The na.rm=TRUE is important here because groups with a proportion of zero need to be ignored since the log of zero is not defined.

In order to calculate Theil’s H, I also need a couple of other pieces of information. First, I need \(t_i\) which is the total population of each tract. I can get this with the apply command.

## # A tibble: 6 x 10
##   statename tractid white latino black asian indigenous other entropy total
##   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl> <dbl>
## 1 Alabama     20100  1601     44   217    14         13    23   0.606  1912
## 2 Alabama     20200   844     75  1214     5          5    27   0.891  2170
## 3 Alabama     20300  2538     87   647    17         14    70   0.755  3373
## 4 Alabama     20400  4030     85   191    18         14    48   0.381  4386
## 5 Alabama     20500  8438    355  1418   295         50   210   0.771 10766
## 6 Alabama     20600  2672    176   738     6         10    66   0.798  3668

Now I just need to calculate entropy for the whole state of Alabama. I can do this by using apply again to sum up tract populations by race and then using prop.table again.

## [1] 1
## [1] 0.8825423

Ok, now I have all the pieces that I need to calculate Theil’s H for alabama:

## [1] 0.2675507

So, to re-capitulate, here is the code altogether that will calculate entropy for Alabama:

## [1] 0.2675507

Notice that I have changed all of the names of objects I create here that contain “alabama” to a more general “state” name. The only place where “Alabama” is referenced is in the first line of the code. This makes the code very re-usable. I can just change the state name that is called in the first subset command and the code will produce a Theil’s H for that state. Lets try that with Oregon:

## [1] 0.1135007

Creating Functions

In cases like this where you will end up re-using the same code on different but identically structured objects, I strongly recommend that you create your own custom function. By creating a custom function, you will only have to write the code you need in one place. Also, if you realize there are bugs in the code later, you only need to fix it in one place.

Custom functions can be created with the following syntax:

In the call to function you can define how many arguments you wish to feed into the function. Then within the curly brackets you can declare code to run on the objects as you named them. Any object referenced within the curly brackets must be an object identified in the initial function call. The last call within the curly brackets is usually the return command which will return whatever value or object you want the function to produce. Here is my code to create a function named calculateTheilH.

The function feeds in an object called state_tracts which should be the data.frame of tracts for a given state. The function then runs the same lines of code that I created above. Lets try it out on a couple of states.

## [1] 0.2675507
## [1] 0.1135007
## [1] 0.2356094
## [1] 0.3633379

I have now simplified my code considerably. Without the function I would have had to copy and paste the code for calculating entropy fifty different times. If I then changed or corrected something later in that code, it would have been a nightmare to fix it for all fifty versions. Now I can keep all that code in one place and I just need to type one line for each state to calculate Theil’s H. But I can improve this even more by iteration.

Iteration

Basically, I want to iterate over states and calculate Theil’s H for each state. Computers are great at this kind of iterative process. Iterative methods are very common in real programming languages because they can handle problems where a lot of repitition is necessary and computers can typically execute them quickly. The way this is handled in most programming languages is by looping. In R, we can solve our iteration problem with a for-loop.

A for-loop will sequentially move through all of the values in a vector and run the code within the curly brackets each time. Usually, this code will reference the particular value of the vector.

Here is a very simple example of a for-loop that iterates through all the unique names of states and prints them out.

## [1] "Alabama"
## [1] "Alaska"
## [1] "Arizona"
## [1] "Arkansas"
## [1] "California"
## [1] "Colorado"
## [1] "Connecticut"
## [1] "Delaware"
## [1] "Florida"
## [1] "Georgia"
## [1] "Hawaii"
## [1] "Idaho"
## [1] "Illinois"
## [1] "Indiana"
## [1] "Iowa"
## [1] "Kansas"
## [1] "Kentucky"
## [1] "Louisiana"
## [1] "Maine"
## [1] "Maryland"
## [1] "Massachusetts"
## [1] "Michigan"
## [1] "Minnesota"
## [1] "Mississippi"
## [1] "Missouri"
## [1] "Montana"
## [1] "Nebraska"
## [1] "Nevada"
## [1] "New Hampshire"
## [1] "New Jersey"
## [1] "New Mexico"
## [1] "New York"
## [1] "North Carolina"
## [1] "North Dakota"
## [1] "Ohio"
## [1] "Oklahoma"
## [1] "Oregon"
## [1] "Pennsylvania"
## [1] "Rhode Island"
## [1] "South Carolina"
## [1] "South Dakota"
## [1] "Tennessee"
## [1] "Texas"
## [1] "Utah"
## [1] "Vermont"
## [1] "Virginia"
## [1] "Washington"
## [1] "West Virginia"
## [1] "Wisconsin"
## [1] "Wyoming"

This is not very useful, but it does give us a hint of a for-loop that would be more useful. We could iterate through unique state names and then use each state name to define a subset of tracts from that state and apply our calculateTheilH function:

## [1] 0.2675507
## [1] 0.1682601
## [1] 0.2372527
## [1] 0.2664347
## [1] 0.2356094
## [1] 0.1620947
## [1] 0.2333971
## [1] 0.1489769
## [1] 0.2653847
## [1] 0.2448094
## [1] 0.1145708
## [1] 0.104742
## [1] 0.3701819
## [1] 0.2661208
## [1] 0.1608354
## [1] 0.1873829
## [1] 0.2045492
## [1] 0.2501826
## [1] 0.08692125
## [1] 0.2872802
## [1] 0.2363536
## [1] 0.3383812
## [1] 0.185217
## [1] 0.2272864
## [1] 0.291448
## [1] 0.2111407
## [1] 0.2229126
## [1] 0.1324474
## [1] 0.08889904
## [1] 0.2874875
## [1] 0.2334191
## [1] 0.3633379
## [1] 0.2035033
## [1] 0.2375026
## [1] 0.2930075
## [1] 0.1489998
## [1] 0.1135007
## [1] 0.3331475
## [1] 0.2328076
## [1] 0.1739929
## [1] 0.2732335
## [1] 0.3013521
## [1] 0.2681015
## [1] 0.1120612
## [1] 0.0518904
## [1] 0.2046499
## [1] 0.1501689
## [1] 0.1359242
## [1] 0.3002023
## [1] 0.1187653

That worked! We just have one problem. These results are just being printed to the console. We want to save them to a new vector. We can do this by initializing a NULL vector and then adding each calculated value to this vector with the concatenate command:

## [1] 50
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05189 0.15284 0.23005 0.21467 0.26727 0.37018

Ok, so now we have a vector of length 50 that gives the Theil’s H for each state. As a final touch-up step, we can go ahead and put this back into a data frame with a state id.

##    statename   theil_h
## 1    Alabama 0.2675507
## 2     Alaska 0.1682601
## 3    Arizona 0.2372527
## 4   Arkansas 0.2664347
## 5 California 0.2356094
## 6   Colorado 0.1620947

The lapply command

It turns out that R is not very efficient with iterative loops. So, while iterative looping can help solve problems, its better to look for a different solution before you start using iteration in R. That different solution is usually the lapply command or its closely related cousin sapply.

The lapply command will apply a given function to every object in a list. If we were to put all of our state-specific tract datasets into a list, we could run an lapply command calling calculateTheilH on the entire list at once. Generally, this will be much faster than a for-loop. The code would look something like:

The only trick is figuring out how to create a list of state tract data where each object in the list is a data frame of tract data for a specific state. It turns out that the split command is just what we need. It will split an object into a list of the same objects by some factor variable. In our case:

## [1] 50
## # A tibble: 6 x 8
##   statename tractid white latino black asian indigenous other
##   <chr>       <dbl> <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 Alabama     20100  1601     44   217    14         13    23
## 2 Alabama     20200   844     75  1214     5          5    27
## 3 Alabama     20300  2538     87   647    17         14    70
## 4 Alabama     20400  4030     85   191    18         14    48
## 5 Alabama     20500  8438    355  1418   295         50   210
## 6 Alabama     20600  2672    176   738     6         10    66

The state_list object is a list where each object is the tract data.frame for a given state. Now we can run our lapply command from above:

## [1] "list"
## [1] 0.2675507

The only downside to the lapply function is that it outputs the results as a list as well. In this case, where the output of our function is a just a simple number, we would like the output to just be a simple vector. It turns out the sapply function can do this. The sapply function is just a wrapper to lapply that converts the output when possible to a simple vector or matrix format.

## [1] "numeric"
##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##  0.2675507  0.1682601  0.2372527  0.2664347  0.2356094  0.1620947