Understanding Data

# Understanding Data
## Sociology 312
### Aaron Gullickson
### University of Oregon
### 2019-08-09

---

background-image: url(images/markus-spiske-666905-unsplash.jpg)
background-size: cover

# What Does Data Look Like?

---

## Most data looks like a spreadsheet

<table>
<caption>Four passengers on the Titanic</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> survival </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:left;"> agegroup </th>
   <th style="text-align:left;"> pclass </th>
   <th style="text-align:right;"> fare </th>
   <th style="text-align:right;"> family </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Survived </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 24.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 69.3000 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Died </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 24.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> Third </td>
   <td style="text-align:right;"> 7.7958 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Survived </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 0.9167 </td>
   <td style="text-align:left;"> Child </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 151.5500 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Died </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 60.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 26.5500 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

- **Observations** are on the rows. Observations are the units from which you are taking measurements.

- **Variables** are on the columns. Variables measure specific attributes of your observations.

---

## The unit of analysis identifies the kind of observation

The **unit of analysis** simply tells you what your observations actually are. We can collect data on many different types of units.

|  Individual People | Countries | Corporations | Universities |
| --- | ----------- | ------ | 
| ![crowd](images/crowd.jpg) | ![countries](images/countries.jpg) | ![logos](images/logos.jpg) | ![pac12](images/pac12.png)|

---

## What is the unit of analysis in each case?

1. Titanic data on passengers?
2. Cross-national data on CO2 emissions?

![titanic passengers](images/titanic-passengers.jpg)]
]

![countries-co2](images/country-co2.png)]
]

---

## Quantitative variables measure quantities

A quantitative variable measures quantities of something. A quantitative variable is always represented as a number. There are two types of quantitative variables.

A **discrete** variable can only take certain values within a range.

- Number of children ever had
- Number of violent crimes committed
- Number of siblings
- Number of Youtube views
]

A **continuous** variable can take any value within a given range.

- Age
- Height
- GDP per capita
- CO2 emissions

]

???

- The most common case of a discrete variable is a  count of things and thus can only be represented by positive whole numbers.

- In practice, we often only measure these variables to a certain decimal level (half inches for height, whole numbers for age), but we could theoretically measure them more precisely with better instruments.

---

## Categorical variables specify a category

Categorical variables indicate which category an observation belongs to from a mutually exclusive set of categories. There are also two types of categorical variables:

An **ordinal** variable is a categorical variables whose values have a clear ordering.

- Highest degree earned
- passenger class on the Titanic
- level of support with an opinion statement
]

A **nominal** variable is a categorical variable whose values are unordered.

- Gender
- Race
- Political party

]

---

## What type of variables do we have here?

<table>
<caption>four passengers on the Titanic</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> survival </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:left;"> agegroup </th>
   <th style="text-align:left;"> pclass </th>
   <th style="text-align:right;"> fare </th>
   <th style="text-align:right;"> family </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Survived </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 24.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 69.3000 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Died </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 24.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> Third </td>
   <td style="text-align:right;"> 7.7958 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Survived </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 0.9167 </td>
   <td style="text-align:left;"> Child </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 151.5500 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Died </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 60.0000 </td>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:left;"> First </td>
   <td style="text-align:right;"> 26.5500 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

--
- **survival:** survival is a nominal variable.

--
- **sex:** sex is a nominal variable.

--
- **age:** age is a continuous variable.

--
- **agegroup:** age group is an ordinal variable.

--
- **pclass:** Passenger class is an ordinal variable.

--
- **fare:** fare paid is a continuous variable (sort of).

--
- **family:** number of family members is a discrete variable.

---

### Think of a dataset measuring something you might care about.

- What are the units of analysis?
- What might a continuous variable in this dataset be?
- What might a discrete variable in this dataset be?
- What might a nominal variable in this dataset be?
- What might an ordinal variable in this dataset be?
]

---

</iframe>

---

layout: false
class: middle, center, inverse
background-image: url(images/isaac-smith-1182057-unsplash.jpg)
background-size: cover

# What Can We Do With Data?

---

## We can look at the distribution of a variable

.pull-left[
<img src="module1_slides_understanding_data_files/figure-html/dist-variable1-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
<img src="module1_slides_understanding_data_files/figure-html/dist-variable2-1.png" width="504" style="display: block; margin: auto;" />
]

???

In some cases, I might simply be interested in the **distribution** of a variable. We refer to these kinds of measures as **univariate statistics** where "univariate" is a fancy way to say "one variable."

- How much variability is there in the amount of money that movies make?
- What percent of passengers survived the Titanic disaster?
- What is the average age of voters in the United States?

---

## We can look at association between variables

.pull-left[
<img src="module1_slides_understanding_data_files/figure-html/measure-assoc1-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
<img src="module1_slides_understanding_data_files/figure-html/measure-assoc2-1.png" width="504" style="display: block; margin: auto;" />
]

???

We are generally most interested in questions about the relationship or association between two or more variables than in a single variable itself. There are multiple ways to measure this association, depending on the type of variables involved.

- Did the probability of surviving the Titanic depend on passenger class?
- Do the earnings of movies vary by genre?
- Is income inequality in a state related to its crime rate?

---

## We can make statistical inferences

.pull-left[
<img src="module1_slides_understanding_data_files/figure-html/stat-infer1-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
<img src="module1_slides_understanding_data_files/figure-html/stat-infer2-1.png" width="504" style="display: block; margin: auto;" />
]

???

- If I told you that in a sample of twenty people, brown-eyed individuals made $5 more an hour than people with other eye color, would you believe that I had found an important source of inequality? 
- Probably not, because with a sample that small, its possible to get unusual results like this just because of the sample you happened to draw. 
- Statistical inference is the technique of quantifying how uncertain we are about our statistical results. It allows us not only to summarize our data, but give a sense of how well we think that data might represent the larger population from which we drew our sample.

---

## We can build models

]

???

In the last module of the course we will expand on what we have learned previously to build **statistical models** that can help us better understand relationships we observe in the data.

Statistical models allow us to examine relationships between multiple variables using more complex mathematical specifications. We use these models to address two types of questions:

- Do I observe a relationship between two variables, even after controlling for potential 
confounders? 
- Does the relationship between two variables differ depending on the context of a third variable?

---

## Models can get more complex

.pull-left[
<img src="module1_slides_understanding_data_files/figure-html/confounders-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
<img src="module1_slides_understanding_data_files/figure-html/interactions-1.png" width="504" style="display: block; margin: auto;" />
]

???

Confounders are variables that might account for the relationship between two other variables and thus produce a **spurious** non-causal relationship. Some real questions we might ask:

- Does the gender advantage in surviving the Titanic hold even after accounting for differences in gender and survival by passenger class? 
- Do racial differences in the probability of being shot by the police hold even after accounting for situational factors such as the type of violation, time of day, place of interaction? 
- Are differences in sexual frequency by marital status a product of age differences by marital status?

We can also go beyond simple statements about the association between two variables and begin to ask how the association between two variables might differ depending on the level or value of a third variable. Some real questions we might ask:

- Does the female advantage in survival on the Titanic get smaller or larger by the class of the passenger?
- Does the relationship between age and support for gay marriage differ by political party affiliation? 
- Does the effect of movie runtime on box office returns vary by the genre of the movie?

---

## Two views on using observational data

Most of the data we use in the social sciences is **observational** rather than **experimental**. We observe what actually happens rather than manipulate "treatments" in order to observe a response.

There are two different views of exactly what the use of statistics contributes to observational data analysis:

- We use statistical modeling to try to mimic the stronger causal claims of experimental research. 
- This can be as simple as "controlling" for other variables to mimic random assignment of a treatment, to the use of "natural" experiments.
]

- We use statistical modeling to describe what we observe in the data in a formal, systematic, and replicable way. 
- We show how the data are either consistent or inconsistent with certain views of a social process as derived from theory. 
]