Statistical Inference

# Statistical Inference
## Sociology 312
### Aaron Gullickson
### University of Oregon
### 2019-08-09

---

background-image: url(images/tim-bennett-EoC_IuYmtug-unsplash.jpg)
background-size: cover

# The Problem of Statistical Inference

---

## What percent of Americans favor ending birthright citizenship?

We can look at the results from our politics dataset:

```r
100*round(prop.table(table(politics$brcitizen)),3)
```

```
## 
##  Oppose Neither   Favor 
##    39.6    28.6    31.7
```

About 31.7% of respondents to the American National Election Study (ANES) favored ending birthright citizenship.

* Is the percent of *all* Americans also 31.7%?

* The ANES is a **sample** of the US voting age **population**. How confident can we be that the sample percent is close to the true population percent?

---

## Drawing a statistical inference

---

## Drawing a statistical inference

---

## Drawing a statistical inference

---

## Parameters and statistics

* Parameters represent unknown measures in the population, such as the population mean or proportion
* Parameters are represented by Greek letters (e.g. the population mean is `$\mu$`)

### Statistics

* Statistics represent known measurements from the sample that estimate the unknown population parameters.
* Statistics are represented by roman letters (e.g. the sample mean `$\bar{x}$`)

]

| Measure             | Parameter | Statistic |
|:-------------------:|:---------:|:---------:|
| mean                | `$\mu$`     | `$\bar{x}$` |
| proportion          | `$\rho$`    | `$\hat{p}$` |
| standard deviation  | `$\sigma$`  | `$s$`       |

]

---

## When samples go bad 👿

Something about our data collection procedure biases our results systematically.

* We made a mistake in our research design. 
* Statistical inference cannot fix this mistake.

]

Just by random chance we happened to draw a sample that is very different from the population on the parameter we care about.

* We didn't do anything wrong! We just had bad luck.
* Statistical inference addresses this form of bias.
]

???

* Even though we know random bias *might* have caused our sample statistic to be different from the population mean, we can't know for sure if this happened or not because we don't know the true population parameter!
* This is what statistical inference is all about. Even though we can never say for certain that our results are close or far away from the true value, we can quantify our uncertainty about how different the sample statistic could potentially be from the population parameter.

---

background-image: url(images/kai-pilger-qHfJPxOnXi4-unsplash.jpg)
background-size: cover

# The Sampling Distribution

---

## Three kinds of distributions

You draw a simple random sample of 100 people from the US population and calculate their mean years of education. There are three kinds of distributions involved in this process:

* The distribution of years of education for the whole US population. 
* Its mean is given by `$\mu$`. 
* The population mean and distribution are unknown. 
]

* The distribution of years of education in your sample. 
* The mean is given by `$\bar{x}$`. 
* The mean and distribution are known and hopefully approximate the population distribution. 
]

### Sampling Distribution

* The distribution of the sample mean `$\bar{x}$` in all possible samples of size 100. 
* We can't know this distribution exactly, but it turns out that we know its general shape.

---

## Example: Height in our class

.pull-left[
* Lets treat our class of 42 students as the population. I want to estimate the average height of the class. 
* In this case, I am omnipotent - I know the population distribution because I collected data for the whole class on Canvas.
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" />
]

---

## How many samples of size 2 are possible?

Lets say I wanted to sample two students to estimate class height. In a class of 42 students, how many unique samples of size 2 exist?

* On the first draw, I have 42 possibilities.

* On the second draw, I have 41 possibilities because I am not putting my first draw back.

* I therefore have `$42*41=1722$` possible samples.

* However, half of these samples are just duplicates of the other half, but sampled in the other order. In one sample, I samples John and Then Kate and in another I sampled Kate and then John.

* Therefore the true number of unique samples is: `$$42*41/2=861$$`

* What if I calculated the sample mean in all 861 samples and looked at the distribution of these sample means?

---

## The sampling distribution

---

## Sampling distributions for different `$n$`

---

## What is the mean of the sampling distributions?

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Distribution </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> Standard Deviation </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Population Distribution </td>
   <td style="text-align:right;"> 66.52 </td>
   <td style="text-align:right;"> 4.87 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sampling Distibution (n=2) </td>
   <td style="text-align:right;"> 66.52 </td>
   <td style="text-align:right;"> 3.36 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sampling Distribution (n=3) </td>
   <td style="text-align:right;"> 66.52 </td>
   <td style="text-align:right;"> 2.71 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sampling Distribution (n=4) </td>
   <td style="text-align:right;"> 66.52 </td>
   <td style="text-align:right;"> 2.32 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sampling Distribution (n=5) </td>
   <td style="text-align:right;"> 66.52 </td>
   <td style="text-align:right;"> 2.04 </td>
  </tr>
</tbody>
</table>

---

## Its the Central Limit Theorem!

.pull-left[
As the sample size increases, the sampling distribution of a sample mean becomes a **normal** distribution.

* The normal distribution is a bell-shaped curve with two characteristics: center and spread. 
* Centered on `$\mu$`, which is the true value in the population.
* With a spread (standard deviation) of `$\sigma/\sqrt{n}$`, where `$\sigma$` is the standard deviation in the population. 
* The center of the sampling distribution is the true value of the parameter and the spread of the sampling distribution shrinks as the sample grows larger. 
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" />
]

---

## The Standard Error

There are three different kinds of standard deviations involved here, one that corresponds to each of the types of distributions.

| Distribution   | Notation | Description | 
|----------------|----------|-------------|
| Population     | `$\sigma$` | Unknown population standard deviation |
| Sample         | `$s$` | Known sample standard deviation that hopefully approximates `$\sigma$` |
| Sampling       | `$\sigma/\sqrt{n}$` | **Standard error**: Standard deviation of the sampling distribution |

The standard error gives us an estimate of the strength of potential random bias in our sample.

---

## Sampling distributions are the 🔑 concept

.pull-left[
* When we draw a sample and calculate the sample mean we are effectively drawing a value from the sampling distribution for the sample mean. 
* If we know what that distribution looks like then we can know the probability of drawing a sample close to or far from the true population parameter. 
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" />
]

---

## But there is a catch!

* The shape of the sampling distribution is determined by:

* the population mean, `$\mu$`
  * the population standard deviation, `$\sigma$`

* But these values are unknown! 😮

* We can substitute the sample standard deviation `$s$` from our sample for the population standard deviation `$\sigma$`. 
* This has consequences. Because we are using a sample value which can also be subject to random bias, this substitution creates greater uncertainty in our estimate which we will address later. 
]
  
--

* **Confidence Intervals**: Provide a range of values within which you feel confident that the true population mean resides. 
* **Hypothesis tests**: Play a game of make believe. If the true population mean was a given value, what is the probability that I would get the sample mean value that I actually did? 
]

---

background-image: url(images/yogi-purnama-en7G3hTSjBQ-unsplash.jpg)
background-size: cover

# Confidence Intervals

---

## Consider this statement

.pull-left[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" />
]

If I construct the following interval:

`$$\bar{x}\pm1.96*\sigma/\sqrt{n}$$`
95% of *all possible samples* that I could have drawn will contain the true population mean `$\mu$` within this interval.

<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Confidence?

We call the interval of `$\bar{x}\pm1.96*\sigma/\sqrt{n}$` the **confidence interval**. What does it mean?

.pull-left[
### Not a probability
* It is tempting to claim that there is a 95% probability that the true population mean is in this interval, but according to classic views of probability this is INCORRECT. 
* The true population mean does not vary. It just is what it is, even if it is unknown. Either your interval contains it or the interval does not contain it. There is no probability.
* The correct interpretation is that "95% of all possible confidence intervals will contain the true population mean."
]

---

## Calculating the confidence interval

The confidence interval is given by `$\bar{x}\pm1.96*\sigma/\sqrt{n}$`.

But we don't know `$\sigma$` because this is the population standard deviation. What can we do?

### Substitute the sample standard deviation

We *can* calculate `$\bar{x}\pm1.96*s/\sqrt{n}$`.

However, this equation is no longer correct because we need to adjust for the added uncertainty of using a sample statistic where we should use a population parameter.

### Use the t-statistic as a fudge factor

The actual formula we want is:

`$$\bar{x} \pm t*s/\sqrt{n}$$`

where `$t$` is the **t-statistic** and will be a number somewhat larger than 1.96.

---

## <i class="fas  fa-calculator "></i> Calculating the t-statistic

The t-statistic you get depends on two characteristics:

* What level of confidence you want. We will always use 95% confidence intervals for this class.

- The number of **degrees of freedom** for the statistic. This is largely a function of sample size. For the sample mean, the degrees of freedom are given by `$n-1$`.

In *R*, you can calculate the t-statistic with the `qt` command. Lets say we wanted the t-statistic for our crime data with 51 observations:

```r
qt(.975, 51-1)
```

```
## [1] 2.008559
```

* Although we want the 95% confidence interval but we need to put in 0.975, because we are getting the upper tail of the distribution which has only 2.5% of the area above. 
* The second argument is the degrees of freedom.

---

## Example: Property crime rates

```r
mean(crimes$Property)
```

```
## [1] 2894.004
```

```r
sd(crimes$Property)
```

```
## [1] 641.5065
```

```r
nrow(crimes)
```

```
## [1] 51
```

Now we can calculate the t-statistic and standard error:

```r
tstat <- qt(.975, 51-1)
se <- 641.5/sqrt(51)
```
]

```r
2894+tstat*se
```

```
## [1] 3074.425
```

The lower limit is given by:

```r
2894-tstat*se
```

```
## [1] 2713.575
```

We are 95% confident that the true mean property crime rate across states is between 2731.6 and 3074.4 crimes per 100,000. 
]

---

## 🤔 Wait, Does that even make sense?

> We are 95% confident that the true mean property crime rate across states is between 2731.6 and 3074.4 crimes per 100,000.

We did the math right, but this statement is still nonsense. Why?

* The crime data are **not a sample**.  
* We have all fifty states plus the District of Columbia. So we have the entire population.
* There is nothing to infer. The mean crime rate across states of 2824 per 100,000 is already the population mean.

### Statistical inference only makes sense for samples

.pull-left[
#### Proper sample
* Popularity data (Add Health)
* Politics data (ANES)
* Sexual frequency data (GSS)
* Earnings data (CPS)

]

---

## Example: Sexual frequency

```r
xbar <- mean(sex$sexf)
s <- sd(sex$sexf)
n <- nrow(sex)
se <- s/sqrt(n)
t <- qt(.975, n-1)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> results </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> sample mean </td>
   <td style="text-align:right;"> 50.127 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sample standard deviation </td>
   <td style="text-align:right;"> 53.597 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sample size (n) </td>
   <td style="text-align:right;"> 2103.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> standard error </td>
   <td style="text-align:right;"> 1.169 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> t-statistic </td>
   <td style="text-align:right;"> 1.961 </td>
  </tr>
</tbody>
</table>
]

```r
xbar+t*se
```

```
## [1] 52.41873
```

```r
xbar-t*se
```

```
## [1] 47.83472
```

I am 95% confident that the mean sexual frequency in the US is between 47.8 and 52.4 times per year.
]

---

## General form of the confidence interval

We can construct confidence intervals for any statistic whose sampling distribution is a normal distribution. This includes:

* means
* mean differences
* proportions
* differences in proportions
* correlation coefficient

The general form of the confidence interval is given by:

`$$\texttt{(sample statistic)} \pm t*(\texttt{standard error})$$`

The only trick is knowing how to calculate the standard error and degrees of freedom for the t-statistic for each particular statistic.

---

## Cheat sheet for SE and df

| Type                    | SE                                                                               | df for `$t$`              |
| :---------------------- | :------------------------------------------------------------------------------- | :---------------------- |
| Mean                    | `$s/\sqrt{n}$`                                                                     | `$n-1$`                   |
| Proportion              | `$\sqrt\frac{\hat{p}*(1-\hat{p})}{n}$`                                             | `$n-1$`                   |
| Mean Difference         | `$\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$`                                     | min( `$n_1-1$`, `$n_2-1$` ) |
| Proportion Difference   | `$\sqrt{\frac{\hat{p}_1*(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2*(1-\hat{p}_2)}{n_2}}$` | min( `$n_1-1$`, `$n_2-1$` ) |
| Correlation Coefficient | `$\sqrt{\frac{1-r^2}{n-2}}$`                                                       | `$n-2$`                   |

## Proportion example

What proportion of voters support removing birthright citizenship?

```r
prop.table(table(politics$brcitizen))
```

```
## 
##    Oppose   Neither     Favor 
## 0.3964134 0.2864559 0.3171307
```

```r
p <- 0.3171307
```

Calculate values:

```r
n <- nrow(politics)
se <- sqrt(p*(1-p)/n)
t <- qt(0.975, n-1)
```
]

```r
p+t*se
```

```
## [1] 0.3311453
```

```r
p-t*se
```

```
## [1] 0.3031161
```

I am 95% confident that the true proportion of American adults who support removing birthright citizenship is between 30.3% and 33.1%.
]

---
 
## Mean difference example

What is the difference in sexual frequency between married and never married individuals?

Use `tapply` to get means by groups and then calculate the difference you want:

```r
tapply(sex$sexf, sex$marital, mean)
```

```
##       Married       Widowed      Divorced     Separated Never married 
##     56.094106      9.222628     41.388720     55.652778     53.617704
```

```r
diff <- 56.094106-53.617704
```

Use `tapply` again to calculate standard deviations by group for the SE calculation:

```r
#use tapply again to get sd by group
tapply(sex$sexf, sex$marital, sd)
```

```
##       Married       Widowed      Divorced     Separated Never married 
##      49.95361      26.61206      55.40738      58.21485      58.81459
```

```r
sd1 <- 49.95361
sd2 <- 58.81459
```

---

## Mean difference example, continued

Use `table` to get sample size by group:

```r
table(sex$marital)
```

```
## 
##       Married       Widowed      Divorced     Separated Never married 
##          1052           137           328            72           514
```

```r
n1 <- 1052
n2 <- 514
```

```r
se <- sqrt(sd1^2/n1+sd2^2/n2)
t <- qt(0.975,n2-1)
diff+t*se
```

```
## [1] 8.403469
```

```r
diff-t*se
```

```
## [1] -3.450665
```
]

.pull-right[
I am 95% confident that, among American adults, married individuals have sex between 8.4 more times per year or 3.45 less times per year than never married individuals, on average. 
]

---

## Proportion difference example

What is the difference in support for removing birthright citizenship between those who have served in the military and those who have not?

```r
prop.table(table(politics$brcitizen, 
                 politics$military), 2)
```

```
##          
##                  No       Yes
##   Oppose  0.4069860 0.3093682
##   Neither 0.2900238 0.2570806
*##   Favor   0.3029902 0.4335512
```

```r
p1 <- 0.3029902
p2 <- 0.4335512
diff <- p2-p1
```
]

```r
table(politics$military)
```

```
## 
##   No  Yes 
## 3779  459
```

```r
n1 <- 3779
n2 <- 459
```
]

---

## Proportion difference example, continued

Calculate standard error and t-statistic:

```r
se <- sqrt(p1*(1-p1)/n1+p2*(1-p2)/n2)
t <- qt(0.975, n2-1)
```

Confidence intervals:

```r
diff-t*se
```

```
## [1] 0.08279002
```

```r
diff+t*se
```

```
## [1] 0.178332
```

I am 95% confident that the percent in support for removing birthright citizenship is between 8.3% and 17.8% higher among who have served in the military than those who have not.

---

## Correlation coefficent example

What is the correlation between age and wages among US workers?

```r
r <- cor(earnings$age, earnings$wages)
```

Use `nrow` to get sample size and then calculate standard error and t-statistic.

```r
n <- nrow(earnings)
se <- sqrt((1-r^2)/(n-2))
t <- qt(0.975, n-2)
```
]

```r
r - t*se
```

```
## [1] 0.2135195
```

```r
r + t*se
```

```
## [1] 0.2235427
```

I am 95% confident that the true correlation coefficient between age and wages among US workers is between 0.214 and 0.224. 
]

---

background-image: url(images/chuttersnap-UmncJq4KPcA-unsplash.jpg)
background-size: cover

# Hypothesis Tests

---

## Game of make believe

.pull-left[
We know what the sampling distribution should look like, but we don't know its center (the true population parameter).

So we set up a game of make-believe:

* Assume that the true parameter is some value.
* If assumption is correct, what is the probability that I would have gotten the sample statistic that I got?
* If this probability is really low, then I reject my assumption. 
]

---

## An almost true story

.pull-right[
Coca-Cola used to do promotionals where it claimed that 1 in 12 bottle caps (8.3%) on a Coca-Cola bottle would earn a free Coke.

When I was a busy assistant professor trying to get tenure, I bought 100 bottles of Coke from the downstairs vending machine and only got 5 winners (5%). (The number is not true, but it is nice and round)

Does the difference between my winning percentage and that claimed by Coca-Cola show that they were lying?
]

---

## Lets set up a null hypothesis

.pull-left[
### In English
* The **null hypothesis** ( `$H_0$` ) is your assumption about the true parameter value. It is your prior assumption unless the data can prove you wrong.

* I assume that Coca-Cola is telling the truth, until I can prove them wrong, so my null hypothesis is that the true percentage of winning bottlecaps is 8.3%. 
]

`$$H_0: \rho=0.083$$`

I use the Greek `$\rho$` to indicate the population proportion of winners. I will use `$\hat{p}$` later to represent the proportion observed in my sample. 
]

---

## Assuming the null hypothesis is true, what is the sampling distribution of my sample proportion?

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-34-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Assuming the null hypothesis is true, what is the sampling distribution of my sample proportion?

* With a sample size of 100, it should be normally distributed.
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-35-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Assuming the null hypothesis is true, what is the sampling distribution of my sample proportion?

* With a sample size of 100, it should be normally distributed.

* The center of the distribution is the true population parameter assuming `$H_0$` is true. In this case, that is 0.083.
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-36-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Assuming the null hypothesis is true, what is the sampling distribution of my sample proportion?

* With a sample size of 100, it should be normally distributed.

* The center of the distribution is the true population parameter assuming `$H_0$` is true. In this case, that is 0.083.

* As we learned in the previous section, the standard error is given by: `$$\sqrt\frac{0.083*(1-0.083)}{100}=0.0276$$`
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-37-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Is the actual sample proportion unusual?

.pull-left[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-38-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
#### How far is our sample proportion from where the center would be if the null hypothesis was true?

* If our sample proportion is far way and unlikely to be drawn, then we **reject the null hypothesis**.
* If our sample proportion is not far away and reasonably likely to be drawn, then we **fail to reject the null hypothesis**.
]

---

## How far is far enough?

.pull-left[
* We determine how far our sample proportion is from the center in terms of the number of standard errors. `$$\frac{\hat{p}-\rho}{SE}=\frac{0.05-0.083}{0.028}=-1.18$$`
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-39-1.png" width="504" style="display: block; margin: auto;" />
]

---

## How far is far enough?

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-40-1.png" width="504" style="display: block; margin: auto;" />
]

---

## How far is far enough?

.pull-left[
* We determine how far our sample proportion is from the center in terms of the number of standard errors. `$$\frac{\hat{p}-\rho}{SE}=\frac{0.05-0.083}{0.028}=-1.18$$`
* What proportion of sample proportions are this low or lower?
* We also need to take account of sample proportions this far away in the opposite direction. This is called a **two-tailed test**.
]

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-41-1.png" width="504" style="display: block; margin: auto;" />
]

---

## How far is far enough?

.pull-right[
<img src="module4_slides_statistical_inference_files/figure-html/unnamed-chunk-42-1.png" width="504" style="display: block; margin: auto;" />
]

---

## The p-value is the endgame

.pull-left[
### Interpretation
The p-value tells you the probability of getting a statistic this far away or farther from the the assumed true population parameter, assuming the null hypothesis is true.

In my case:
> Assuming the null hypothesis is true, there is a 24% probability that I would have gotten a sample proportion (0.05) this far or farther from the true poplation paramter (0.083).

]

We use the `pt` command to get the area in the lower tail and multiply by two to get both tails:

```r
2*pt(-1.18,99)
```

```
## [1] 0.2408278
```

* The first argument is always the **negative version** of the number of standard errors.
* The second argument is the degrees of freedom to adjust for the fact that we use sample standard deviations to get standard errors. 
]

---

## The critical value

.right-column[
* We will reject the null hypothesis if our p-value is low enough. 
* The **critical value** is the benchmark for how low our p-value has to be to reject. 
  * We will reject the null hypothesis if the p-value is lower than or equal to the critical value.
  * We will fail to reject the null hypothesis if the p-value is higher than the critical value.
* The standard but entirely arbitrary critical value used across the sciences is 0.05 (5%).
* For the Coca-Cola bottlecap case, the p-value is 0.24, so we fail to reject the null hypothesis that Coca-Cola's claim was truthful.
]

---

## The general procedure of hypothesis testing

1. State a **null hypothesis**.

2. Calculate a **test statistic** that tells you how different your sample is from what you would expect under the null hypothesis. For our purposes, this test statistic is always the number of standard errors above or below the center of the sampling distribution.

3. Calculate the **p-value** for the given test statistic.

4. Based on the p-value, either **reject** or **fail to reject** the null hypothesis.

---

## Hypothesis tests of relationships

The hypothesis test we are most interested in is whether the association we observe between two variables in our sample holds in the population.

* Mean/proportion differences: Are the means/proportions between two groups in the population different? In other words, is the mean/proportion difference non-zero?

* Correlation coefficient: is the correlation coefficient in the population non-zero?

### Statistical Significance

.right-column[
If you reject the null hypothesis of "no association," then the association you observe in the sample is said to be **statistically significant**.

* Don't confuse statistical and substantive significance. In a large sample, even very small substantive associations can be found to be statistically significant. On the flip side, in small samples, very large substantive associations can fail to be statistically significant.
]

---

## Example: Mean differences

Is there a difference in sexual frequency between married and never married individuals? Formally, my null hypothesis is:

`$$H_0: \mu_M-\mu_N=0$$`

Where `$\mu_M$` is the population mean sexual frequency of married individuals and `$\mu_N$` is the population mean sexual frequency of never-married individuals.

```r
tapply(sex$sexf, sex$marital, mean)
```

```
##       Married       Widowed      Divorced     Separated Never married 
##     56.094106      9.222628     41.388720     55.652778     53.617704
```

```r
diff <- 56.094106-53.617704
diff
```

```
## [1] 2.476402
```

In my sample, married individuals have sex about 2.5 more times per year than never-married individuals, on average. Is this difference far enough from zero to reject the null hypothesis?

---

## Example: Mean differences, continued

I calculate the standard error of the sample mean difference, as per the formula in the previous section, which requires the sample SD's and sample size of both groups.

```r
tapply(sex$sexf, sex$marital, sd)
```

```
##       Married       Widowed      Divorced     Separated Never married 
##      49.95361      26.61206      55.40738      58.21485      58.81459
```

```r
sd1 <- 49.95361
sd2 <- 58.81459
table(sex$marital)
```

```
## 
##       Married       Widowed      Divorced     Separated Never married 
##          1052           137           328            72           514
```

```r
n1 <- 1052
n2 <- 514
se <- sqrt(sd1^2/n1+sd2^2/n2)
```

---

## Example: Mean differences, continued

I can now calculate how many standard errors my sample mean difference is from zero

```r
diff/se
```

```
## [1] 0.8208339
```

I then feed the negative version of this number into the `pt` formula and multiply by two to get my p-value:

```r
2*pt(-0.8208,n2-1)
```

```
## [1] 0.4121414
```

Assuming that the true sexual frequency difference between married and never-married individuals is zero in the population, there is a 41.2% chance of observing a sample sexual frequency difference of 2.5 years or larger in a sample of this size. Thus, I **fail to reject** the null hypothesis there there is no difference in the average sexual frequency between married and never-married individuals in the US.

---

## Example: Proportion differences

Is there a difference in support for removing birthright citizenship between those who have served in the military and those who have not?

```r
prop.table(table(politics$brcitizen, 
                 politics$military), 2)
```

```
##          
##                  No       Yes
##   Oppose  0.4069860 0.3093682
##   Neither 0.2900238 0.2570806
*##   Favor   0.3029902 0.4335512
```

```r
p1 <- 0.303
p2 <- 0.434
```

```r
diff <- p2-p1
diff
```

```
## [1] 0.131
```
]

```r
table(politics$military)
```

```
## 
##   No  Yes 
## 3779  459
```

```r
n1 <- 3778
n2 <- 460
se <- sqrt(p1*(1-p1)/n1+p2*(1-p2)/n2)
diff/se
```

```
## [1] 5.393601
```

```r
pt(-5.391824, n2-1)
```

```
## [1] 5.591127e-08
```
]

---

## 🤔 A p-value of 5.591127e-08?

* What does the value of 5.591127e-08 mean?

* The number is so small that R is reporting it using scientific notation: `$$5.591127 x 10^{-8}$$`

* That means we need to move the decimal place over 8 spaces to the left. So the number is really 0.00000005591127. So I would interpret my result as:

> Assuming that there is no difference in the US population between those who have served in the military and those who have not in support for removing birthright citizenship, there is less than a 0.000006% chance of observing a sample difference in proportion of 13.1% or greater by random chance in a sample of this size. Thus, I **reject** the null hypothesis there there is no difference in support for removing birthright citizenship between those who have served in the military and those who have not.

---

## Example: Correlation coefficient

Is there a relationship between a person's age and their wages in the US?

```r
r <- cor(earnings$age, earnings$wages)
r
```

```
## [1] 0.2185311
```

```r
n <- nrow(earnings)
se <- sqrt((1-r^2)/(n-2))
r/se
```

```
## [1] 85.46473
```

```r
pt(-85.46473, n-2)
```

```
## [1] 0
```
]

.pull-right[
Assuming no association between a person's age and their wage in the US, there is almost 0% chance of observing a correlation coefficient between age and wages of 0.219 or larger in absolute magnitude in a sample of this size. Therefore, I **reject** the null hypothesis that there is no relationship between a person's age and their wages in the US population.
]