What Can We Do With Data?
We now know what data looks like, but what do social scientists do with data? in the first part of this course, we will learn three fundamental data analysis tasks: analysis of the distribution of a single variable, measuring association, and statistical inference. In the final part of the course, we will build on these fundamentals to learn how to build more complex statistical models.
How is a variable distributed?
Sometimes, we just want to understand what a single variable “looks like.” We may simply be interested in its “average” or we may want to know something else, like how spread out the values of the variable are. In these cases, we calculate univariate ("one variable) statistics on the distribution of a variable. Typically, univariate statistics aren’t as interesting to social scientists as the measures of association discussed below, but even then its often a good idea to look at univariate statistics to understand all of the variables in your research. In some cases, the calculation of a univariate statistics is the important question at hand. For example, when poll researchers try to figure out who is going to win an election, they are very much interested in the univariate distribution of support for each candidate, which gives the proportions of likely voters who intend to vote for each candidate. Here are some other questions we could ask in our data:
- How much variability is there in the amount of money that movies make?
- What percent of passengers survived the Titanic disaster?
- What is the average age of voters in the United States?
Measuring association
Social scientists are often most interested in the relationships, or association, between two or more variables. These associations allow us to test hypotheses about causal relationships between underlying social constructs. For example, we might be interested in whether divorce affected children’s well-being. In this case, we would want to look at the relationship between the categorical variable indicating whether a child’s parents were divorced and some measure of their well-being, such as feelings of stress, academic performance, etc. There are different ways of measuring association depending on the types of variables involved.. Here are some questions about association we could ask in our data:
- Did the probability of surviving the Titanic depend on passenger class? (categorical and categorical)
- Do the earnings of movies vary by genre? (quantitative and categorical)
- Is income inequality in a state related to its crime rate? (quantitative and quantitative)
In the first part of the course, we will learn how basic measures of association between two variables, depending on what kind of variables we are using. In the final part of the course, we will return to this topic when we learn how to build more complex statistical models discussed below.
Making statistical inferences
If I told you that in my sample of twenty people, brown-eyed individuals make $5 more than all other eye colors combined, would you believe me? You probably shouldn’t, because in a sample of twenty people, even when drawn without systematic bias, weird results like this are not unlikely just by random chance. If I told you I observed this phenomenon on a well-drawn sample of 20,000 individuals, you would probably be more likely to believe me.
The underlying concept here is called statistical inference. We often want to draw conclusions about the larger populations from which our samples are drawn. Statistical inference is the technique of quantifying how uncertain we are about whether our data are similar to the population or not. When you hear press reports on political polls with the term “margin of error,” they are referring to statistical inference.
Many introductory statistics course focus most of their attention on statistical inference, partly because it is more abstract and complex. However, statistical inference is always secondary to the basic descriptive measures of univariate, bivariate, and multivariate statistics. Therefore, I spend considerably less time on this topic than in most statistics courses, so that we can focus on the more important stuff.
Building Models
Although our basic measures of association are useful, the most common tool in social science analysis is a statistical model in which the user can specify the relationships between variables by some kind of mathematical functions. In the final module of the course, we will learn how to build basic versions of these models that allow us to look at the relationships between multiple quantitative and categorical variables. This section will build on our prior work in all of the previous modules. We will specifically focus on two uses of statistical models.
First, statistical models will allow us to “control” for other variables when we look at the association between any two variables. Controlling for other variables is important because they may be confounded with the relationship we want to measure. For example, we may be interested in the relationship between marital status (e.g. never married, married, widowed, divorced) and sexual frequency in our GSS data. However, these different groups vary significantly in their age. Never married individuals are much younger than all of the other groups and widowed individuals much older. Given the fact that sexual frequency tends to decline with age (something we will show later in this term), it seems problematic to just compare the average sexual frequency across these groups. This is because age confounds the relationship between marital status and sexual frequency. Statistical models will gives us tools to account for this problem and to get a better estimate of the relationship between marital status and sexual frequency, net of this confounder.
Second, statistical models will allow us to account for how the relationship between two variables might differ depending on the context of a third variable. This is what we call an interaction. For example, lets say we were interested in the relationship between the number of sports played and a student’s popularity (measured by friend nominations) in our Add Health data. Because of gender norms, we might expect that this relationship is different for boys and girls. We can use statistical models to empirically examine whether this is true. This kind of contextualization is an important component of sociological practice.
Observational Data, Experimental Thinking
Much of the data that we use in sociology is observational rather than experimental. In an experimental design, the researcher randomly selects subjects to receive some sort of treatment and then observes how this treatment affects some outcome. Thus, the research engages in systematic manipulation to observe a response. In observational data, the researcher does not directly manipulate the environment but rather just observes and records the social setting as it is.
Experimental data can be more powerful than observational data because the random assignment of a treatment through researcher manipulation strengthens claims of causation. If there is a relationship between treatment and response it can only come through a causal relationship or random chance. In observational data, the relationship between any two variables can be a result of a causal relationship, random chance,spuriousness. Spuriousness occurs when other variables produce the relationship between two other variables rather than them directly causing each other. The example above about marital status and sexual frequency is a simple example. If we note that widows have less sex than other people, we may be tempted to think that something about being widowed reduces someone’s sexual drive or their interactions with others. However, the more obvious explanation is that widows tend to be quite a bit older than other marital status groups and older people have less sex. Age is generating a spurious relationship between widowhood and sexual frequency. This kind of spuriousness is the reason for the frequent claim that “correlation is not causation.”
There are two different philosophical approaches to the statistical analysis of observational data where spuriousness can be a problem. The first approach approaches it as pseudo-experimental data. The goal of this approach is to try to find ways to mimic the experimental design approach with observational data. At a basic level this can include “controlling” for other variables (which we will learn) and can extend to a variety of techniques of causal modeling that are intending to use some feature of the data to recover causation (which we will not learn).
The second approach treats statistical analysis as a way to describe observed data in a formal, systematic, and replicable way. The goal is to establish to what extent the data are consistent with competing theories that seek to understand the outcome in question, rather than to mimic the experimental approach. Although quantitative and qualitative approaches are often seen as philosophically different approaches, this approach to observational data shares many features with more purely qualitative approaches to data analysis. This is the approach that I take in this course.