What Does Data Look Like?

The data that we look at typically take the format of a “spreadsheet” with rows and columns. The table below shows some characteristics of four randomly drawn passengers from the Titanic, in this type of spreadsheet format.

Table 1: Data on four passengers from the Titanic
survival sex age agegroup pclass fare family
Survived Female 24.0000 Adult First 69.3000 0
Died Male 24.0000 Adult Third 7.7958 0
Survived Male 0.9167 Child First 151.5500 3
Died Male 60.0000 Adult First 26.5500 0

Clearly, we can see variation in who survived and died, the passenger classes they were housed in, gender, and age. We also have a measure of the fare they paid for the trip (in English pounds) and the number of family members traveling with them. To understand how to think about data, we need to understand the concepts of an observation and a variable and the distinction between them.

The observations

The observations are what you have on the rows of your dataset. In the Titanic example, the observations are individual passengers on board the Titanic, but observations can take many different forms.

We use the term unit of analysis to distinguish what kind of observation you have in your dataset. If you are interviewing individual people and recording their responses, then the unit of analysis is individual people. If you are collecting cross-national data by country, then the unit of analysis would be a country. If you are analyzing data on the “best colleges in the US” then the unit of analysis is a university/college. The most common unit of analysis that we will see in this course is an individual person, but several of our datasets involve other units of analysis and it is important to keep in mind that an observation can be many different kinds of things.

The variables

The variables are what you have on the columns of your dataset. Variables measure specific attributes of your observations. If you conduct a survey of individual people and ask them for their age, gender, and education, then these three attributes would be recorded as variables in your dataset. We refer to them as “variables” because they can take different values across the observations. If you were to conduct a survey of individual people and ask your respondents if they are human, then you probably wouldn’t have a proper variable because everyone would likely respond “yes” and there would be no variation (although we can’t necessarily rule out jedis.

There are two major types of variables. Some variables measure quantities of something and thus can be represented by a number. We refer to these as quantitative variables. Other variables indicate a category to which the observation belongs. We refer to these as categorical variables.

Quantitative variables

Quantitative variables measure quantities of something. A person’s height, a worker’s hourly wage, the number of children that a woman has given birth to, a country’s gross domestic product, a US State’s poverty rate, and the percent of a university’s student body that are women are all examples of quantitative variables. They can all be represented by a number which indicates how much of the thing the observation has.

There are two important sub-types of quantitative variables. Discrete variables can logically only take certain values within a range, while continuous variables can logically take any value within a range. The most common example of a discrete variable is a count variable. The number of children that a woman has given birth to is an example of a count variable. This number can only take the value of whole numbers (integers) such as 0, 1, 2, 3, and so on. It makes no sense if a respondent says they have given birth to 2.5 children. Count variables are discrete variables because only whole numbers are logical responses.

A person’s height is an example of a continuous variable. It is true that we typically measure height only down to a certain level of precision, typically inches in the United States. We might think that if we were to measure a person’s height in inches, it would only take whole number values and therefore it is discrete. But limitations in measurement don’t define whether a variable is continuous or discrete. Rather the distinction is whether the value could be logically measured to any degree of accuracy. We often measure height out to half inches and we could imagine that if we have a precise enough measurement instrument, we could measure a person’s height out to any decimal level that we desired. So, it is perfectly sensible for someone to say they were 69.825467 inches tall, even though we might think they are being a bit tedious.

Note that in both the height and the number of children examples, there are logical limits to the values. You can’t have negative children or height. There are no exact upper limits to the values that either can take, but we would likely think we have a data coding error if we saw a report of a 20 foot person or a woman who gave birth to 50 children. This is what I mean by the statement “within a range” above. Both discrete and continuous variables can be limited in the range of values that they can take. What distinguishes them from each other is what values they can logically take within that limited range.

Categorical variables

Categorical variables are not represented by numerical quantities but rather by a set of mutually exclusive categories to which observations can belong. The gender, race, political party affiliation, and highest educational degree of a person, the public/private status of a university, and the passenger class of a passenger on the Titanic are all examples of categorical variables.

There are also two sub-types of categorical variables. Ordinal variables are categorical variables where the categories have an explicit ordered structure, while nominal variables are categorical variables where the categories are unordered. Highest educational degree is an example of an ordinal variables because it is ordered such that Graduate Degree > BA Degree > AA Degree > High School Diploma > No degree. Passenger class is also an ordinal variables that starts in Third class (or steerage - Think Leonardo DiCaprio) and ends in First class (think Kate Winslet), with a Second class in between.

Race, gender, and political party affiliation are all examples of nominal variables. The categories here have no ordering to them. While some people might have their own political party preferences, these sort of normative evaluations of categories are irrelevant. For the same reason, even the variable of survival on the Titanic is a nominal variable. We don’t judge the value of life and death.