class: center, middle, inverse, title-slide # Building Models ## Sociology 312 ### Aaron Gullickson ### University of Oregon ### 2019-08-09 --- class: inverse, center, middle background-image: url(images/ridham-nagralawala-kuJkUTxR0z4-unsplash.jpg) background-size: cover # The OLS Regression Line --- ## Drawing straight lines <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-2-1.png" width="864" style="display: block; margin: auto;" /> --- ## Drawing straight lines <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-3-1.png" width="864" style="display: block; margin: auto;" /> --- ## Elements of a straight line <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-4-1.png" width="864" style="display: block; margin: auto;" /> --- ## The formula for a straight line -- .pull-left[ ### in high school `$$y=a+bx$$` * `\(a\)` is the **y-intercept**: the value of `\(y\)` when `\(x\)` is zero. * `\(b\)` is the **slope**: the change in `\(y\)` for a one-unit increase in `\(x\)` (the rise over the run). * In this kind of set-up the constant values of `\(a\)` and `\(b\)` are called **coefficients** - a constant value that is multiplied by a variable. ] -- .pull-right[ ### How we do it in statistics `$$\hat{y}_i=b_0+b_1x_i$$` * `\(\hat{y}_i\)`: The predicted value of `\(y\)` for `\(i\)`th observation from the linear formula. * `\(b_0\)`: The predicted value of `\(y\)` when `\(x\)` is zero. * `\(b_1\)`: The predicted change in `\(y\)` for a one-unit increase in `\(x\)`. ] --- ## How do we know which line is best? -- .pull-left[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ #### We choose the line that minimizes the error in our prediction For a given observation `\(i\)`, the value `\(y_i-\hat{y}_i\)` gives the residual or error in the prediction. To get the total error in prediction, we can calculate the sum of squared residuals: `$$SSR=\sum_{i=1}^n (y_i-\hat{y}_i)^2=\sum_{i=1}^n (y_i-b_0-b_1x_i)^2$$` The best-fitting line is the one with the smallest possible sum of squared residuals. This is called the **Ordinary Least Squares (OLS) regression line**. ] --- class: center, middle <iframe src="https://aarongullickson.shinyapps.io/reducerss/"> </iframe> --- ## Formulas for the best-fitting line `$$b_1=r * \frac{s_y}{s_x}$$` `$$b_0=\bar{y}-b_1*\bar{x}$$` -- We can calculate by hand in R, although we will learn an easier way later: ```r slope <- cor(crimes$Property,crimes$Unemployment)*sd(crimes$Property)/sd(crimes$Unemployment) slope ``` ``` ## [1] 148.6814 ``` ```r mean(crimes$Property)-slope*mean(crimes$Unemployment) ``` ``` ## [1] 1628.35 ``` -- `$$\hat{\texttt{property_crimes}}_i=1628.4+148.7(\texttt{unemployment_rate}_i)$$` --- ## The OLS regression line as a model .left-column[ ![model plane](images/scarbor-siu-pKGvVjAp0P8-unsplash.jpg) ] .right-column[ The OLS regression line is often called a **linear model** because we are measuring the relationship between two variables by applying a **linear function** to characterize the relationship. the `lm` command can be used to create a model object in R: ```r model <- lm(Property~Unemployment, data=crimes) ``` The tilde (~) is used to indicate the relationship between the two variables with the dependent variable on the left hand side. I can then use the `coef` command on this model to get my coefficients (i.e. intercept and slope). ```r coef(model) ``` ``` ## (Intercept) Unemployment ## 1628.3503 148.6814 ``` ] --- ## Use `summary` for model TMI ```r summary(model) ``` ``` ## ## Call: ## lm(formula = Property ~ Unemployment, data = crimes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -987.7 -453.2 -100.8 453.7 1588.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) *## (Intercept) 1628.35 373.50 4.360 6.67e-05 *** *## Unemployment 148.68 42.82 3.472 0.00109 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 580.5 on 49 degrees of freedom *## Multiple R-squared: 0.1974, Adjusted R-squared: 0.181 ## F-statistic: 12.05 on 1 and 49 DF, p-value: 0.001089 ``` --- ## Add the best-fitting line to your scatterplot .pull-left[ ```r ggplot(crimes, aes(x=Unemployment, y=Property))+ geom_point()+ geom_smooth(method="lm", se=FALSE)+ labs(x="unemployment rate", y="property crimes (per 100,000)")+ theme_bw() ``` * `geom_smooth` with the argument `method="lm"` will add the OLS regression line to your scatterplot. * `se=FALSE` will suppress a confidence band which I will show later. ] .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Interpreting the results `$$\hat{\texttt{property_crimes}}_i=1628.4+148.7(\texttt{unemployment_rate}_i)$$` -- .pull-left[ ### Intercept **The model predicts** that states with **no unemployment** will have a property crime rate of 1628 crimes per 1000,000, **on average**. ] -- .pull-right[ ### Slope **The model predicts** that a **one percent increase** in the unemployment rate **is associated with** an increase of 149 property crimes per 100,000, **on average**. ] --- ## Try interpreting these numbers Try interpreting these numbers from a regression model where the dependent variable is box office returns (in millions of dollars) and the independent variable is the Tomato Meter (from 0 to 100). `$$\hat{\texttt{box_office}}_i=18.32+0.56(\texttt{meter}_i)$$` -- .pull-left[ ### Intercept The model predicts that movies that receive a zero on the Tomato Meter will make $18.32 million, on average. ] -- .pull-right[ ### Slope The model predicts that a one percentage point increase in the Tomato Meter is associated with a $560,000 increase in box office returns, on average. ] --- ## Nonsensical Intercepts Try interpreting these numbers from a regression model where the dependent variable is sexual frequency (sexual encounters per year) and the independent variable is age in years. `$$\hat{\texttt{sex}}_i=107.96-1.30(\texttt{age}_i)$$` -- .pull-left[ ### Intercept The model predicts that newborns will have sex 107.96 times per year, on average. 😮 Say what??!! ] -- .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Getting meaningful intercepts .pull-left[ Lets subtract some constant `\(a\)` from the variable `\(x\)`: `$$x^*=x-a$$` The value for zero on our new re-centered `\(x^*\)` will be `\(a\)` on the original scale. In the formula of the `lm` command in R, we can do this easily by surrounding our math with `I()` which tells R to apply the function inside and treat it as a new variable: ```r model <- lm(sexf~I(age-18), data=sex) round(coef(model), 2) ``` ``` ## (Intercept) I(age - 18) ## 84.58 -1.30 ``` ] -- .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Now interpret these numbers `$$\hat{\texttt{sex}}_i=84.58-1.30(\texttt{age}_i-18)$$` -- .pull-left[ ### Intercept The model predicts that 18 year old individuals have sex 84.58 times per year, on average. ] -- .pull-right[ ### Slope The model predicts that a one year increase in age is associated with 1.3 fewer sexual encounters per year, on average. ] --- ## How good is `\(x\)` as a predictor of `\(y\)`? I pick a random observation from the dataset and ask you to guess the value of `\(y\)`. What is your best guess? -- .pull-left[ ### Choose `\(\bar{y}\)` * Because it is the balancing point, the mean will give you the smallest error, on average. * If you repeat this procedure, your average error in prediction will be equal to `\(s_y\)`. ] .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How good is `\(x\)` as a predictor of `\(y\)`? I pick a random observation from the dataset and *tell you its value of `\(x\)`*, and then ask you to guess the value of `\(y\)`. What is your best guess? -- .pull-left[ ### Choose `\(\hat{y}_i\)` from the linear model * Assuming that a linear model is reasonable, the predicted value from this model will be your best guess. * The average error in your prediction will be equal to the average residual from the model, `\(|\hat{y}_i-y_i|\)`. ] .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How much did we reduce the error? .pull-left[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ #### On average, what proportion of the red line is the green line across all observations? Red: `\(\sum_{i=1}^n (y_i-\bar{y})^2\)` Green: `\(\sum_{i=1}^n (y_i-\hat{y}_i)^2\)` Proportion: `\(\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{\sum_{i=1}^n (y_i-\hat{y}_i)^2}\)` It turns out: `$$\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{\sum_{i=1}^n (y_i-\hat{y}_i)^2}=r^2$$` ```r cor(crimes$Unemployment, crimes$Property)^2 ``` ``` ## [1] 0.1974281 ``` ] --- ## R-squared is a measure of goodness of fit .pull-left[ ```r cor(crimes$Unemployment, crimes$Property)^2 ``` ``` ## [1] 0.1974281 ``` About 19.7% of the variation in property crime rates across states can be accounted for by variation in unemployment rates across states. ] -- .pull-right[ ```r summary(lm(Property~Unemployment, data=crimes)) ``` ``` ## ## Call: ## lm(formula = Property ~ Unemployment, data = crimes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -987.7 -453.2 -100.8 453.7 1588.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1628.35 373.50 4.360 6.67e-05 *** ## Unemployment 148.68 42.82 3.472 0.00109 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 580.5 on 49 degrees of freedom *## Multiple R-squared: 0.1974, Adjusted R-squared: 0.181 ## F-statistic: 12.05 on 1 and 49 DF, p-value: 0.001089 ``` ] --- ## Statistical inference for linear models .pull-left[ The population model is: `$$\hat{y}_i=\beta_0+\beta_1(x_i)$$` The null hypothesis of no relationship is given by: `$$H_0: \beta_1=0$$` How do we test? ] -- .pull-right[ ```r summary(lm(Property~Unemployment, data=crimes)) ``` ``` ## ## Call: ## lm(formula = Property ~ Unemployment, data = crimes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -987.7 -453.2 -100.8 453.7 1588.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1628.35 373.50 4.360 6.67e-05 *** *## Unemployment 148.68 42.82 3.472 0.00109 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 580.5 on 49 degrees of freedom ## Multiple R-squared: 0.1974, Adjusted R-squared: 0.181 ## F-statistic: 12.05 on 1 and 49 DF, p-value: 0.001089 ``` Just look at a `summary` of the model! 😎 ] --- ## ⚠️ Linear models only fit straight lines <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-21-1.png" width="864" style="display: block; margin: auto;" /> --- ## ⚠️ Outliers can be influential <iframe src="https://aarongullickson.shinyapps.io/influentialpoints/"> </iframe> --- ## ⚠️ Don't extrapolate beyond range of data <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-22-1.png" width="864" style="display: block; margin: auto;" /> --- class: inverse, center, middle background-image: url(images/jens-johnsson-DHJ71drt-ug-unsplash.jpg) background-size: cover # The Power of Controlling for Other Variables --- ## Does getting educated also get you laid? -- .pull-left[ ```r model <- lm(sexf~educ, data=sex) coef(model) ``` ``` ## (Intercept) educ ## 49.7295901 0.0266939 ``` -- * The model predicts that a one year increase in education is associated with 0.027 more instances of sex per year. * Going from a high school diploma to a bachelor's degree gets you laid 0.108 ( `\(0.027*4\)` ) more times per year. ] -- .pull-right[ ### ⚠️ Potential spuriousness detected! ```r cor(sex$age, sex$sexf) ``` ``` ## [1] -0.3974668 ``` ```r cor(sex$age, sex$educ) ``` ``` ## [1] -0.06018569 ``` * Younger people have more sex than older people. * Younger people have more education, on average. * What if the positive relationship between sexual frequency and education is because younger people have more sex and younger people are more educated? ] --- ## Age might be a confounding variable .pull-left[ ### Causal <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ ### Spurious <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Account for a confounding variable .pull-left[ Just add the potential confounder to the model: `$$\hat{\texttt{frequency}}_i=b_0+b_1(\texttt{education}_i)+b_2(\texttt{age}_i)$$` 😮 Thats right, you can have more than one independent variable in a linear model. But what does it mean? ] -- .pull-right[
] --- ##
Calculating the model ```r model <- lm(sexf~I(educ-12)+I(age-18), data=sex) coef(model) ``` ``` ## (Intercept) I(educ - 12) I(age - 18) ## 85.929115 -0.427747 -1.303385 ``` -- .pull-left[ ### Why these numbers? Slopes and intercepts are chosen that minimize the sum of the squared residuals, just as for a bivariate OLS regression model. ] -- .pull-right[ ### Interpretation * The model predicts that 18-year old individuals with 12 years of education have sex about 85.9 times per year on average. * The model predicts that, **holding constant age**, a one year increase in education is associated with 0.43 *fewer* instances of sex per year, on average. * The model predicts that, **holding education constant**, a one year increase in age is associated with 1.30 fewer instances of sex per year, on average. ] --- ## 🤔 Holding Constant? Because both independent variables are in the model at the same time, the effect of each variable is net of the indirect effect of the other variable. We can say this in different ways: -- * The model predicts that, **among individuals who are the same age**, a one year increase in education is associated with 0.43 fewer instances of sex per year, on average. -- * The model predicts that, **holding constant age**, a one year increase in education is associated with 0.43 fewer instances of sex per year, on average. -- * The model predicts that, **controlling for age**, a one year increase in education is associated with 0.43 fewer instances of sex per year, on average. --- ## What is the effect of education on sexual frequency? -- .pull-left[ ### The relationship seemed positive... <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### but is was negative once we controlled for age! <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-30-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] -- .pull-right[ * The dependent variable is identified in the caption. ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] .pull-right[ * The dependent variable is identified in the caption. * Each model is shown in a column. ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] .pull-right[ * The dependent variable is identified in the caption. * Each model is shown in a column. * Independent variables are on the rows. If a cell is blank, then the given variable is not in the model. ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] .pull-right[ * The dependent variable is identified in the caption. * Each model is shown in a column. * Independent variables are on the rows. If a cell is blank, then the given variable is not in the model. * Within each cell: * The top number is the slope. * The bottom number in parenthesis is the standard error for the slope. ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] .pull-right[ * The dependent variable is identified in the caption. * Each model is shown in a column. * Independent variables are on the rows. If a cell is blank, then the given variable is not in the model. * Within each cell: * The top number is the slope. * The bottom number in parenthesis is the standard error for the slope. * The asterisks give benchmarks of the p-value for rejecting the null hypothesis that a slope is zero. ] --- ## How to present multiple linear models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">50.05<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">84.58<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">85.93<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.04)</td> <td style="padding-right: 12px; border: none;">(2.37)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">years of educ.</td> <td style="padding-right: 12px; border: none;">0.03</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.41)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.38)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.30<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.00</td> <td style="border-top: 1px solid black;">0.16</td> <td style="border-top: 1px solid black;">0.16</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Education centered on 12 years.</span></td> </tr> </table> ] ] .pull-right[ * The dependent variable is identified in the caption. * Each model is shown in a column. * Independent variables are on the rows. If a cell is blank, then the given variable is not in the model. * Within each cell: * The top number is the slope. * The bottom number in parenthesis is the standard error for the slope. * The asterisks give benchmarks of the p-value for rejecting the null hypothesis that a slope is zero. * Summary statistics are shown at the bottom. The `\(R^2\)` value is the variance in the dependent variable accounted for by all the independent variables collectively. ] --- ## Why stop at two variables? .pull-left[ ```r model1 <- lm(BoxOffice~I(Runtime-90), data=movies) model2 <- lm(BoxOffice~I(Runtime-90)+I(Year-2001), data=movies) model3 <- lm(BoxOffice~I(Runtime-90)+I(Year-2001) +TomatoMeter, data=movies) ``` ] .pull-right[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting box office returns</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 3</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">24.00<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">20.54<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">6.59<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.67)</td> <td style="padding-right: 12px; border: none;">(2.67)</td> <td style="padding-right: 12px; border: none;">(3.26)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Runtime</td> <td style="padding-right: 12px; border: none;">1.39<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">1.39<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">1.25<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.08)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Year of release</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.55</td> <td style="padding-right: 12px; border: none;">0.43</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.34)</td> <td style="padding-right: 12px; border: none;">(0.33)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Tomato meter</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.35<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.05)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.12</td> <td style="border-top: 1px solid black;">0.12</td> <td style="border-top: 1px solid black;">0.14</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2553</td> <td style="border-bottom: 2px solid black;">2553</td> <td style="border-bottom: 2px solid black;">2553</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="5"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis</span></td> </tr> </table> ] ] --- class: inverse, center, middle background-image: url(images/v2osk-c9OfrVeD_tQ-unsplash.jpg) background-size: cover # Including Categorical Variables as Predictors --- ## Gender and sexual frequency ```r tapply(sex$sexf, sex$gender, mean) ``` ``` ## Male Female ## 53.27521 47.42087 ``` ```r 47.421-53.275 ``` ``` ## [1] -5.854 ``` -- Women report 5.854 fewer sexual encounters per year than men. Note that I use the term *report* here because its not exactly clear why these numbers would be different. The difference could reflect differences by sexual orientation, or it could just be that either men over-report or women under-report sexual frequency. It could also be sampling error. --- ## Make an indicator variable `$$\texttt{female}_i=\begin{cases} 1 & \text{if female}\\ 0 & \text{otherwise} \end{cases}$$` * male is the **reference** category. * female is the **indicated** category. * It operates like an on/off switch. -- ```r sex$female <- as.numeric(sex$gender=="Female") table(sex$gender, sex$female) ``` ``` ## ## 0 1 ## Male 972 0 ## Female 0 1131 ``` --- # Make a scatterplot with indicator <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-41-1.png" width="864" style="display: block; margin: auto;" /> --- # Make a scatterplot with indicator <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-42-1.png" width="864" style="display: block; margin: auto;" /> --- # Make a scatterplot with indicator <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-43-1.png" width="864" style="display: block; margin: auto;" /> --- ## `tapply` vs. `lm` ### Two separate means ```r tapply(sex$sexf, sex$female, mean) ``` ``` ## 0 1 ## 53.27521 47.42087 ``` -- .pull-left[ ### Mean and mean difference ```r model <- lm(sexf~female, data=sex) coef(model) ``` ``` ## (Intercept) female ## 53.275206 -5.854339 ``` * Intercept is the mean for the reference (men) * Slope is the mean difference which in this case tells us that women report 5.854 fewer instances of sex per year than men. ] -- .pull-right[ ### Reverse reference ```r sex$male <- as.numeric(sex$gende=="Male") model <- lm(sexf~male, data=sex) coef(model) ``` ``` ## (Intercept) male ## 47.420866 5.854339 ``` What changes and why? ] --- ## Categorical variables in `lm` There is no need to create indicator variables. Just feed in categorical variables directly: ```r model <- lm(sexf~gender, data=sex) coef(model) ``` ``` ## (Intercept) genderFemale ## 53.275206 -5.854339 ``` -- * *R* knows what to do with the variable. It creates its own indicator variable. -- * The reference for the categorical variable is already set as the first category, which in this case is male. You can use the `relevel` command to change the reference: ```r model <- lm(sexf~relevel(gender, "Female"), data=sex) coef(model) ``` ``` ## (Intercept) relevel(gender, "Female")Male ## 47.420866 5.854339 ``` --- ## More than two categories ```r model <- lm(sexf~marital, data=sex) coef(model) ``` ``` ## (Intercept) maritalWidowed maritalDivorced ## 56.0941065 -46.8714787 -14.7053870 ## maritalSeparated maritalNever married ## -0.4413287 -2.4764022 ``` -- * Each category gets an indicator variable, **except for one.** Which one is missing here? -- * Married is the reference category. The category not included is always the reference category. -- * Each coefficient gives the mean difference between the indicated category and the reference category. --- ## Interpretations ```r model <- lm(sexf~marital, data=sex) coef(model) ``` ``` ## (Intercept) maritalWidowed maritalDivorced ## 56.0941065 -46.8714787 -14.7053870 ## maritalSeparated maritalNever married ## -0.4413287 -2.4764022 ``` -- * Married individuals have sex 56.1 times per year, on average. -- * Widowed individuals have sex 46.9 fewer times per year than **married individuals**, on average. -- * Divorced individuals have sex 14.7 fewer times per year than **married individuals**, on average. -- * Separated individuals have sex -0.4 fewer times per year than **married individuals**, on average. -- * Never-married individuals have sex 2.5 fewer times per year than **married individuals**, on average. --- ## Why? We can already calculate mean differences. Why do it in a model? -- In a model, we can control for other variables. For example, how much of the differences in sexual frequency by marital status result from differences in age? -- ```r model <- lm(sexf~marital+I(age-18), data=sex) coef(model) ``` ``` ## (Intercept) maritalWidowed maritalDivorced ## 98.014498 -15.279868 -10.461043 ## maritalSeparated maritalNever married I(age - 18) ## -8.189547 -24.339338 -1.471529 ``` --- ## Compare the models .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting sexual frequency</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">56.09<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">98.01<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.61)</td> <td style="padding-right: 12px; border: none;">(2.65)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Widowed</td> <td style="padding-right: 12px; border: none;">-46.87<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-15.28<sup style="vertical-align: 0px;">**</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(4.75)</td> <td style="padding-right: 12px; border: none;">(4.69)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Divorced</td> <td style="padding-right: 12px; border: none;">-14.71<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-10.46<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(3.31)</td> <td style="padding-right: 12px; border: none;">(3.06)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Separated</td> <td style="padding-right: 12px; border: none;">-0.44</td> <td style="padding-right: 12px; border: none;">-8.19</td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(6.37)</td> <td style="padding-right: 12px; border: none;">(5.90)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Never married</td> <td style="padding-right: 12px; border: none;">-2.48</td> <td style="padding-right: 12px; border: none;">-24.34<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(2.81)</td> <td style="padding-right: 12px; border: none;">(2.84)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-1.47<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.08)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.05</td> <td style="border-top: 1px solid black;">0.19</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">2103</td> <td style="border-bottom: 2px solid black;">2103</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="4"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Married is reference category for marital status</span></td> </tr> </table> ] ] -- .pull-right[ #### Controlling for age gets rid of the bias between marital groups due to age differences. * The difference between widowed and married was reduced from -46.8 to -15.3 once we controlled for age because the widowed are much older than the married. * The difference between never married and married increased from -2.5 to -24.3 once we controlled for age because the never married are much younger than the married. ] --- class: inverse, center, middle background-image: url(images/denys-nevozhai-7nrsVjvALnA-unsplash.jpg) background-size: cover # Interaction Terms --- ## Adding context to a relationship .pull-left[ #### What if the relationship between number of children and hourly wages varies by gender? ```r ggplot(earnings, aes(x=nchild, y=wages, * color=gender))+ geom_jitter(alpha=0.1)+ geom_smooth(method="lm", se=FALSE)+ labs(x="number of children", y="hourly wages in USD")+ theme_bw() ``` ] .pull-right[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-53-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Additive models will miss context ```r model_add <- lm(wages~nchild+gender, data=earnings) coef(model_add) ``` ``` ## (Intercept) nchild genderFemale ## 25.239547 1.155477 -3.974105 ``` `$$\hat{\texttt{wages}}_i=25.24+1.55(\texttt{nchild}_i)-3.97(\texttt{female}_i)$$` What is the relationship between wages and number of children for men and women? -- .pull-left[ ### Men The `\(\texttt{female}_i\)` variable is an indicator variable that is zero for men, so: `$$\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i)-3.97(0)\\ \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i) \end{eqnarray*}$$` ] -- .pull-right[ ### Women The `\(\texttt{female}_i\)` variable is an indicator variable that is one for women, so: `$$\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i)-3.97(1)\\ \hat{\texttt{wages}}_i & = & (25.24-3.97)+1.55(\texttt{nchild}_i)\\ \hat{\texttt{wages}}_i & = & 21.27+1.55(\texttt{nchild}_i)\\ \end{eqnarray*}$$` ] --- ## Because additive models make parallel lines .pull-left[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-55-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ `$$\hat{\texttt{wages}}_i=25.24+1.55(\texttt{nchild}_i)-3.97(\texttt{female}_i)$$` * The effect of number of children on wages ($1.55) is **forced** to be the same for men and women. * The wage difference between men and women ($3.97) is **forced** to be the same at all values of number of children. ] --- ## We need a multiplicative model ```r model_mult <- lm(wages~nchild*gender, data=earnings) coef(model_mult) ``` ``` ## (Intercept) nchild genderFemale ## 24.719748 1.778860 -2.839198 ## nchild:genderFemale ## -1.334728 ``` `$$\hat{\texttt{wages}}_i=24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)-1.33(\texttt{nchild}_i)(\texttt{female}_i)$$` What is the relationship between wages and number of children for men and women? -- .pull-left[ ### Men The `\(\texttt{female}_i\)` variable is an indicator variable that is zero for men, so: `$$\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(0)-1.33(\texttt{nchild}_i)(0)\\ \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i) \end{eqnarray*}$$` ] -- .pull-right[ ### Women The `\(\texttt{female}_i\)` variable is an indicator variable that is one for women, so: `$$\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(1)-1.33(\texttt{nchild}_i)(1)\\ \hat{\texttt{wages}}_i & = & (24.72-2.84)+(1.78-1.33)(\texttt{nchild}_i)\\ \hat{\texttt{wages}}_i & = & 21.82+0.45(\texttt{nchild}_i)\\ \end{eqnarray*}$$` ] --- ## Multiplicative models give non-parallel lines .pull-left[ <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-57-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ `$$\begin{eqnarray*}\hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)\\ & & -1.33(\texttt{nchild}_i)(\texttt{female}_i)\\\end{eqnarray*}$$` * This model shows that men and women get different returns to wages for the number of children with men getting a much greater return ($1.78 to $0.45). * This models shows that the wage gap starts at $2.84 when men and women have no children and grows by $1.33 for every child. ] --- ## Two approaches -- .pull-left[ ### Separate models ```r coef(lm(wages~nchild, data=subset(earnings, gender=="Female"))) ``` ``` ## (Intercept) nchild ## 21.8805499 0.4441326 ``` ```r coef(lm(wages~nchild, data=subset(earnings, gender=="Male"))) ``` ``` ## (Intercept) nchild ## 24.71975 1.77886 ``` * Two intercepts (women and men) * Two slopes (women and men) ] -- .pull-right[ ### Interaction term ```r coef(lm(wages~nchild*gender, data=earnings)) ``` ``` ## (Intercept) nchild ## 24.719748 1.778860 ## genderFemale nchild:genderFemale ## -2.839198 -1.334728 ``` * One intercept (men) and one difference in intercept (women) * One slope (men) and one difference in slope (women) ] --- ## Interaction terms give difference in slopes | Value | Separate models | Interaction terms | | :------------------------------------------------------------- | --------------: | ----------------: | | **Intercept ** | | | | Men's wages with no children | $24.72 | $24.72 | | Women's wages with no children | $21.88 | | | Difference in men's and women's wages with no children | | -$2.84 | | **Slope** | | | | Men's return for an additional child | $1.78 | $1.78 | | Women's return for an additional child | $0.45 | | | Difference in men's and women's return for an additional child | | -$1.33 | --- ## Interpretation ```r coef(lm(wages~nchild*gender, data=earnings)) ``` ``` ## (Intercept) nchild genderFemale nchild:genderFemale ## 24.719748 1.778860 -2.839198 -1.334728 ``` `$$\hat{\texttt{wages}}_i = 24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)-1.33(\texttt{nchild}_i)(\texttt{female}_i)$$` -- * The model predicts that men with no children make $24.72/hour, on average. -- * The model predicts that **among workers with no children**, women make $2.84 less than men, on average. -- * The model predicts that **among men**, having an additional child at home is associated with at a $1.78 increase in hourly wages. -- * The model predicts that the gain in hourly wages from having an additional child at home is $1.33 smaller for women than it is for men. -- * The **main effect** of each variable in the interaction term is only the effect when the other variable in the interaction term is zero/the reference category. --- ## Why not always separate models? -- .pull-left[ .stargazer[ <table cellspacing="0" align="center" style="border: none;"> <caption align="top" style="margin-bottom:0.3em;">OLS regression models predicting hourly waages</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">Intercept</td> <td style="padding-right: 12px; border: none;">24.72<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">24.79<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">number of children</td> <td style="padding-right: 12px; border: none;">1.78<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">1.47<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.05)</td> <td style="padding-right: 12px; border: none;">(0.05)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">woman</td> <td style="padding-right: 12px; border: none;">-2.84<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-3.13<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.10)</td> <td style="padding-right: 12px; border: none;">(0.10)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">woman x number of children</td> <td style="padding-right: 12px; border: none;">-1.33<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-1.08<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.07)</td> <td style="padding-right: 12px; border: none;">(0.07)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">age</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.27<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.00)</td> </tr> <tr> <td style="border-top: 1px solid black;">R-squared</td> <td style="border-top: 1px solid black;">0.02</td> <td style="border-top: 1px solid black;">0.07</td> </tr> <tr> <td style="border-bottom: 2px solid black;">N</td> <td style="border-bottom: 2px solid black;">145647</td> <td style="border-bottom: 2px solid black;">145647</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="4"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05. Standard errors in parenthesis. Age centered on 40 years.</span></td> </tr> </table> ] ] .pull-right[ ```r model1 <- lm(wages~nchild*gender, data=earnings) model2 <- lm(wages~nchild*gender+I(age-40), data=earnings) ``` * We can add in additional control variables without forcing everything to vary by context. * We can do a hypothesis test directly on whether the slopes really are different by looking at the p-value on the interaction term. ] --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` -- | | LHS | HS | AA | BA | Grad | | :---------------- | --: | --: | --: | --: | ---: | | Man | | | | | | | Woman | | | | | | | Gender difference | | | | | | --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ----: | --: | --: | --: | ---: | | Man | 16.27 | | | | | | Woman | | | | | | | Gender difference | | | | | | --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | --: | --: | --: | ---: | | Man | 16.27 | | | | | | Woman | 16.27-5.06 | | | | | | Gender difference | -5.06 | | | | | --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | ---------: | --: | --: | ---: | | Man | 16.27 | 16.27+4.71 | | | | | Woman | 16.27-5.06 | | | | | | Gender difference | -5.06 | | | | | --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | --------------: | --: | --: | ---: | | Man | 16.27 | 16.27+4.71 | | | | | Woman | 16.27-5.06 | 16.27+4.71-5.06 | | | | | Gender difference | -5.06 | -5.06 | | | | --- ## Interacting two categorical variables An additive model: ```r coef(lm(wages~gender+education, data=earnings)) ``` ``` ## (Intercept) genderFemale educationHS Diploma ## 16.268045 -5.059960 4.711734 ## educationAA Degree educationBachelors Degree educationGraduate Degree ## 8.396302 17.126833 24.633114 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | --------------: | --------------: | ---------------: | ---------------: | | Man | 16.27 | 16.27+4.71 | 16.27+8.40 | 16.27+17.13 | 16.27+24.63 | | Woman | 16.27-5.06 | 16.27+4.71-5.06 | 16.27+8.40-5.06 | 16.27+17.13-5.06 | 16.27+24.63-5.06 | | Gender difference | -5.06 | -5.06 | -5.06 | -5.06 | -5.06 | -- * Gender differences are **forced** to be the same at every education level * Returns to degree are **forced** to be the same for men and women --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | --: | --: | --: | --: | ---: | | Man | | | | | | | Woman | | | | | | | Gender difference | | | | | | --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ----: | --: | --: | --: | ---: | | Man | 15.79 | | | | | | Woman | | | | | | | Gender difference | | | | | | --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | --: | --: | --: | ---: | | Man | 15.79 | | | | | | Woman | 15.79-3.84 | | | | | | Gender difference | -3.84 | | | | | --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | ---------: | --: | --: | ---: | | Man | 15.79 | 15.79+4.82 | | | | | Woman | 15.79-3.84 | | | | | | Gender difference | -3.84 | | | | | --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | -------------------: | --: | --: | ---: | | Man | 15.79 | 15.79+4.82 | | | | | Woman | 15.79-3.84 | 15.79+4.82-3.84-0.41 | | | | | Gender difference | -3.84 | -4.25 | | | | --- ## Interacting two categorical variables: A multiplicative model: ```r coef(lm(wages~gender*education, data=earnings)) ``` ``` ## (Intercept) genderFemale ## 15.7856064 -3.8435512 ## educationHS Diploma educationAA Degree ## 4.8249567 8.5837026 ## educationBachelors Degree educationGraduate Degree ## 18.3010614 25.8186025 ## genderFemale:educationHS Diploma genderFemale:educationAA Degree ## -0.4080405 -0.6740955 ## genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree ## -2.5447988 -2.5189415 ``` | | LHS | HS | AA | BA | Grad | | :---------------- | ---------: | -------------------: | -------------------: | --------------------: | --------------------: | | Man | 15.79 | 15.79+4.82 | 15.79+8.58 | 15.79+18.30 | 15.79+25.82 | | Woman | 15.79-3.84 | 15.79+4.82-3.84-0.41 | 15.79+8.58-3.84-0.67 | 15.79+18.30-3.84-2.54 | 15.79+25.82-3.84-2.52 | | Gender difference | -3.84 | -4.25 | -4.51 | -6.38 | -6.36 | --- ## Two ways to view it -- .pull-left[ #### Differences in gender gap by degree | Degree | Gender gap | | :----- | ---------: | | None | -3.84 | | HS | -3.84-0.41 | | AA | -3.84-0.67 | | BA | -3.84-2.54 | | Grad | -3.84-2.52 | ] -- .pull-right[ #### Differences in returns for men and women | Degree | Men's return | Women's return | | :----- | -----------: | -------------: | | HS | 4.82 | 4.82-0.41 | | AA | 8.58 | 8.58-0.67 | | BA | 18.3 | 18.3-2.54 | | Grad | 25.82 | 25.82-2.52 | ] --- ## Same underlying reality <img src="module5_slides_building_models_files/figure-html/unnamed-chunk-78-1.png" width="864" style="display: block; margin: auto;" />