class: center, middle, inverse, title-slide # Distribution of a Variable ## Sociology 312 ### Aaron Gullickson ### University of Oregon ### 2019-08-09 --- class: inverse, center, middle background-image: url(images/eaters-collective-rS1GogPLVHk-unsplash.jpg) background-size: cover # What is a Distribution? --- ## The concept of a distribution When we refer to the **distribution** of a variable, we are referring to how the different values of that variable are distributed across the given observations. -- .pull-left[ ###
Look at it - We can make a plot that shows the distribution. - We make different kinds of plots for categorical and quantitative variables. * **Barplots** for categorical variables * **Histograms** for quantitative variables ] -- .pull-left[ ###
Measure it - We can calculate summary measures of the **center** and **spread** of the distribution. - We an only calculate summary measures for quantitative variables. ] --- ## Calculating frequencies In order to display the distribution of a **categorical variable**, we first need to calculate the **frequency** which is the number of observations that belong to each possible category. We can do this easily in *R* with the `table` command: ```r table(politics$party) ``` ``` ## ## Democrat Republican Independent Other ## 1465 1241 1381 151 ``` -- Lets convert these frequencies into **proportions** by dividing through by the total number of observations. We can also do this easily in *R* by adding the `sum` command to the previous command: ```r prop <- table(politics$party)/sum(table(politics$party)) prop ``` ``` ## ## Democrat Republican Independent Other ## 0.34568193 0.29282681 0.32586126 0.03563001 ``` --- ## Proportions and percents *R* also has a built-in function called `prop.table` that will calculate proportions automatically. We just need to feed the output of the `table` command into it. ```r prop.table(table(politics$party)) ``` ``` ## ## Democrat Republican Independent Other ## 0.34568193 0.29282681 0.32586126 0.03563001 ``` -- We can employ this "wrapping" feature of *R* to do some more tidying up. In this case, I want to `round` the number of digits and multiply by 100 to turn my proportions into percents. I also use the `sort` command to sort values from highest to lowest. ```r percent <- sort(round(100*prop.table(table(politics$party)),1), decreasing=TRUE) percent ``` ``` ## ## Democrat Independent Republican Other ## 34.6 32.6 29.3 3.6 ``` --- ## How can we display distribution graphically? -- .pull-left[ ### Do not use a piechart <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### Use a barplot <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ theme_bw() ``` 1. Multiple commands are linked together with `+` signs. --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r *ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ theme_bw() ``` 1. Multiple commands are linked together with `+` signs. 2. The first command of `ggplot` takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the `aes` command that defines *aesthetics* for the full plot. --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ * geom_bar()+ scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ theme_bw() ``` 1. Multiple commands are linked together with `+` signs. 2. The first command of `ggplot` takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the `aes` command that defines *aesthetics* for the full plot. 3. The second command is `geom_bar`. In general, all plots require some kind of "geometry" command defined as `geom_something`. This tells `ggplot` that I want to plot bars. At this point, my basic plot is done. The remaining commands just add more bells and whistles for a nicer plot. --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ * scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ theme_bw() ``` 1. Multiple commands are linked together with `+` signs. 2. The first command of `ggplot` takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the `aes` command that defines *aesthetics* for the full plot. 3. The second command is `geom_bar`. In general, all plots require some kind of "geometry" command defined as `geom_something`. This tells `ggplot` that I want to plot bars. At this point, my basic plot is done. The remaining commands just add more bells and whistles for a nicer plot. 4. `scale_y_continuous(label=scales::percent)` causes my proportions on the y-axis to reported as percents. --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ scale_y_continuous(label=scales::percent)+ * labs(x="party affiliation", y=NULL, * title="Distribution of party affiliation", * caption="Source: ANES 2016")+ theme_bw() ``` 1. Multiple commands are linked together with `+` signs. 2. The first command of `ggplot` takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the `aes` command that defines *aesthetics* for the full plot. 3. The second command is `geom_bar`. In general, all plots require some kind of "geometry" command defined as `geom_something`. This tells `ggplot` that I want to plot bars. At this point, my basic plot is done. The remaining commands just add more bells and whistles for a nicer plot. 4. `scale_y_continuous(label=scales::percent)` causes my proportions on the y-axis to reported as percents. 5. The `labs` command can be used to add nice labeling of axes and to create titles and captions. --- ## Constructing a barplot in ggplot We will use `ggplot` to construct graphs. Here is the full code for the barplot. ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ * theme_bw() ``` 1. Multiple commands are linked together with `+` signs. 2. The first command of `ggplot` takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the `aes` command that defines *aesthetics* for the full plot. 3. The second command is `geom_bar`. In general, all plots require some kind of "geometry" command defined as `geom_something`. This tells `ggplot` that I want to plot bars. At this point, my basic plot is done. The remaining commands just add more bells and whistles for a nicer plot. 4. `scale_y_continuous(label=scales::percent)` causes my proportions on the y-axis to reported as percents. 5. The `labs` command can be used to add nice labeling of axes and to create titles and captions. 6. `theme_bw` defines a theme for the overall plot. I prefer `theme_bw` to the default theme in `ggplot`. --- ## Code and output .pull-left[ ```r ggplot(politics, aes(x=party, y=..prop.., group=1))+ geom_bar()+ scale_y_continuous(label=scales::percent)+ labs(x="party affiliation", y=NULL, title="Distribution of party affiliation", caption="Source: ANES 2016")+ theme_bw() ``` ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Visualize quantitative variables with a histogram <img src="module2_slides_distribution_variable_files/figure-html/hist-example-1.png" width="864" style="display: block; margin: auto;" /> --- ## How a histogram is created -- 1. We break the variable into equivalent intervals called **bins**. For a histogram of movie runtime length, we might use bins of 5 minutes width, so our bins would look like 0-5 minutes, 5-10 minutes, 10-15 minutes, 15-20 minutes, etc. -- 2. We calculate the frequency of observations that fall into each bin. Technically, we need to decide which bin to put cases that straddle two bins (e.g. exactly 5 minutes). *R* defaults to putting these cases in the lower category. -- 3. We make a barplot of these frequencies, but we put no space between the bars. --- ## Code and output for making a histogram .pull-left[ ```r ggplot(movies, aes(x=Runtime))+ geom_histogram(fill="skyblue", color="black", binwidth=5)+ labs(x="runtime in minutes", title="Distribution of movie runtime")+ theme_bw() ``` * You only need an `x` aesthetic for a histogram which is just your variable * use `binwidth` to specify the width of the bins * use `fill` and `color` to specify the fill and border color respectively for your bars ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center, middle <iframe src="https://aarongullickson.shinyapps.io/histogram/"> </iframe> --- ## What are we looking for in a histogram? .pull-left[ ### Shape Is it symmetric or skewed? ### Center Where is the center or peak of the distribution and is there only one? ### Spread How spread out are the values around the center? ### Outliers Are there any observations that have relatively very high or low values? ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle background-image: url(images/willie-fineberg-64iuIOektb4-unsplash.jpg) background-size: cover # The Center of a Distribution --- ## What does "center" mean? -- ### Mean The mean is the **balancing point** of a distribution. Imagine trying to put a column underneath a histogram so that it does not tip one direction or the other. This balancing point is the mean. -- ### Median The median is the **midpoint** of the distribution. At this point, 50% of the observations have lower values, and 50% have higher values. -- ### Mode The mode is the **high point** of the distribution, or the peak. It is typically much less useful than the other two measures. --- ## The mean and the median <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-17-1.png" width="864" style="display: block; margin: auto;" /> --- ## Calculating the mean The mean (represented mathematically as `\(\bar{x}\)`) is calculated by taking the sum of the variable divided by the number of observations, or in math speak: `$$\bar{x}=\frac{\sum_{i=1}^n x_i}{n}$$` -- ### 😱 Equations??!! Don't panic! We will walk through what these symbols mean. -- * `\(x_i\)`: We use a lower-case letter like `\(x\)` or `\(y\)` to refer to a generic variable. The subscript indicates a particular observation. So, `\(x_1\)` means the value of variable `\(x\)` for the first observation. The `\(x_i\)` subscripts means some generic observation's value of `\(x\)`. -- * `\(n\)`: We use `\(n\)` to refer generically to the number of observations. So, `\(x_n\)` gives the value of `\(x\)` for the last observation. -- * We use the `\(\sum (something)\)` term to say sum something up. In this case, `\(\sum_{i=1}^n x_i\)` means to "sum the variable `\(x\)` from the first observation to the last." --- ## Calculate the mean in *R* `$$\bar{x}=\frac{\sum_{i=1}^n x_i}{n}$$` To calculate the mean we just sum up all the values of `\(x\)` and divide by the number of observations. The `sum` command will sum up a variable and the `nrow` command will give us the number of observations, so: ```r sum(movies$Runtime)/nrow(movies) ``` ``` ## [1] 105.2256 ``` The mean move runtime is 105.2 minutes. -- Alternatively, we could just use the `mean` command in R: 😎 ```r mean(movies$Runtime) ``` ``` ## [1] 105.2256 ``` --- ## Calculating the median .pull-left[ ### The Midpoint * We just need to sort the observations from smallest to largest and pick the exact middle point of the distribution. * If there are an odd number of observations, there will always be an exact midpoint. * If we have an even number of observations, we have to take the two values closest to the midpoint and take their mean. ] -- .pull-right[ ```r nrow(movies) ``` ``` ## [1] 2553 ``` With an odd number of 2553 movies, the exact midpoint is the 1277th movie. We can use the `sort` command to sort and then extract the 1277th movie by using square brackets: ```r sort(movies$Runtime)[1277] ``` ``` ## [1] 102 ``` Alternatively, we can use the `median` command: ```r median(movies$Runtime) ``` ``` ## [1] 102 ``` ] --- ## Why are the mean and median different? .pull-left[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-23-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ * In perfectly symmetric distributions, the mean and the median will be the same. In other words, the balancing point will be at the midpoint. * Skewness will "pull" the mean in the direction of the skew, but not the median. This is because the mean will need to move in that direction to maintain balance. * In highly skewed distributions, this can lead to significant differences between the median and the mean. ] --- ## Skewness can create large differences <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-24-1.png" width="864" style="display: block; margin: auto;" /> --- ## Modal Categories .pull-left[ * We don't generally calculate measures of "center" for categorical variables. * Since the mode is the most common observation, we can make an exception for this case. * The **modal category** is the most frequent category in a categorical variable. ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle background-image: url(images/victoria-strukovskaya-OhL_qEqpef4-unsplash.jpg) background-size: cover # Percentiles and the Five-Number Summary --- ## Percentiles/Quantiles .pull-left[ * A given percentile tells you what percent of the distribution is below that number. * We have already seen one example of a percentile: the median. The median is the 50th percentile. 50% of the observations are below this value. * Percentiles are sometimes also called **quantiles**, but I will use the term percentile in this course. ] -- .pull-right[ ### Calculating percentiles in *R* The `quantile` command in R will calculate a given percentile. To calculate the percentile with this command, we need to add a second argument called `probs` where we feed in a list of proportions. So if we wanted to calculate the 13th and 76th percentile of movie runtime: ```r quantile(movies$Runtime, probs=c(0.13, 0.76)) ``` ``` ## 13% 76% ## 89 114 ``` 13% of movies are 89 minutes or shorter and 76% of movies are 114 minutes or shorter. ] --- ## The five-number summary .pull-left[ If I run the `quantile` command without the `probs` argument, I get: ```r quantile(movies$Runtime) ``` ``` ## 0% 25% 50% 75% 100% ## 45 93 102 114 219 ``` * The 25th percentile, 50th percentile, and 75th percentile are called the **quartiles** because they split the data into four equal quarters. * When combined with the minimum (0%) and maximum (100%), they create the **five-number summary**. ] -- .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Anatomy of the boxplot <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-29-1.png" width="864" style="display: block; margin: auto;" /> ??? * Use a single vertical number line * Draw a box with a bottom and top at the 25th and 75th percentiles * Inside the box, draw a thick line at the median * The height of the box is the difference between the 25th and 75th percentile which is called the **Interquartile range** or **IQR** for short. * Draw "whiskers" above and below the box. These whiskers should be drawn out to the maximum and minimum values or a total distance of 1.5 x IQR, whichever is shortest. * Plot out individual points that are beyond the 1.5 x IQR limit individually. --- ## Code and output for boxplot .pull-left[ ```r ggplot(movies, aes(x="", y=Runtime))+ geom_boxplot(fill="grey", outlier.color="red")+ labs(x=NULL, y="runtime in minutes", title="Boxplot of movie runtime")+ theme_bw() ``` * The `x=""` in the aesthetics is not necessary but does create a nicer looking x axis. * You can use `fill` to determine color of the box and `outlier.color` to determine color of individual points. ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-30-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Detecting skewness in boxplots .pull-left[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-31-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center, middle <iframe src="https://aarongullickson.shinyapps.io/percentiles/"> </iframe> --- class: inverse, center, middle background-image: url(images/priscilla-du-preez-3H0fHhhefsA-unsplash.jpg) background-size: cover # Measuring the Spread of a Distribution --- ## Distributions can vary in their spread <img src="module2_slides_distribution_variable_files/figure-html/unnamed-chunk-33-1.png" width="864" style="display: block; margin: auto;" /> --- ## Measures of spread ### Interquartile Range The **Interquartile Range** (IQR) is the distance between the 25th and 75th percentile. It can be calculated by the `IQR` command in *R*: ```r IQR(movies$Runtime) ``` ``` ## [1] 21 ``` The 75th percentile of movie runtime is 21 minutes longer than the 25th percentile. -- ### Variance and Standard Deviation **Standard deviation** (SD) is the most common measure of spread in a variable. Loosely, standard deviation measures the average distance from the mean of all observations. **Variance** is the squared value of standard deviation. --- ## Calculating the standard deviation The standard deviation `\((s)\)` is calculated with the following formula: `$$s=\sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}$$` -- ### 😮 More Equations??!! Don't panic! We will go through it one step at a time. -- `\((x_i-\bar{x})\)`: The distance between each observation's value and the mean. Some of these values are positive and some are negative. If we summed these values up across all observations, the sum would equal zero by definition. ```r distance <- movies$Runtime-mean(movies$Runtime) ``` -- `\((x_i-\bar{x})^2\)`: We square this distance to get rid of the negative values. ```r distance_sq <- distance^2 ``` --- ## Calculating the standard deviation `\(\sum_{i=1}^n(x_i-\bar{x})^2\)`: The sum of the squared distance, sometimes abbreviated *SSX*. ```r ssx <- sum(distance_sq) ``` -- `\(\sum_{i=1}^n(x_i-\bar{x})^2/(n-1)\)`: Dividing through by the number of observations gives us the "average" squared distance from the mean. This number is the **variance**. ```r variance <- ssx/(nrow(movies)-1) ``` --- ## Calculating the standard deviation `\(\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2/(n-1)}\)`: Squared distance is a hard thing to conceptualize, so we square root to get back to distance. ```r sd <- sqrt(variance) sd ``` ``` ## [1] 16.72411 ``` The average movie is about 16.7 minutes away from the mean movie runtime. -- We can also just use the `sd` command: 😎 ```r sd(movies$Runtime) ``` ``` ## [1] 16.72411 ```