Chapter 9 Correlation Coefficient

We’ve discussed several ways to reduce data–to summarize the key features of many observations using a single (or a few) numbers.

A histogram visually shows the density in chosen bins.
The average tells us the location of a set of observations. Remember the seesaw analogy.
The SD tells us the scale (or spread or disperson) of a set of observations. We can describe a list of numbers as being “about [the average] give or take [the SD].”

The correlation coefficient allows us to describe the relationship between two variables.

Before, we compared variables by comparing their histograms, averages, or SDs. The correlation coefficient is our first summary that compares two variables directly (rather than summarizing just one).

9.1 Intuition

The correlation coefficient measures how well two variables “go together.”

“Go together” means “as one goes up, the other goes up [or down].”
“Go together” has linearity built into the meaning. The correlation coefficient does not describe curved relationships.

The figure below shows some scatterplots and how well I might say these variables go together.

However, I am firmly opposed to any rules that link particular correlation coefficients to strength of relationship.

Imagine the following studies:

A study comparing two measures of the same concept.
A study comparing the effect of a dose of vitamin D in the first hour after birth on lifespan.

A “weak” or “small” correlation in the first study would be impossibly large in the second. The interpretation of the strength of a relationship must be made by a substantive expert in a particular substantive context.

I use two guidelines to interpret a correlation coefficient:

0.9 seems a lot stronger than 0.7, but 0.4 seems barely stronger than 0.2.
Around 0.4 [-0.4], the a correlation becomes “easily noticeable” without studying the plot carefully. For smaller datasets, this threshold increases toward 1 [-1]; for larger datasets, the threshold shrinks toward 0.

Exercise 9.1 Guess the correlation coefficient for each scatterplot below.

Solution

dataset	r
Dataset 1	-0.60
Dataset 2	0.45
Dataset 3	0.90
Dataset 4	0.45
Dataset 5	0.55
Dataset 6	0.55
Dataset 7	0.10
Dataset 8	0.85
Dataset 9	0.85
Dataset 10	0.35
Dataset 11	0.60
Dataset 12	0.80

9.2 Computing

Suppose we have the dataset below.

x	y
1	10
3	15
2	12
4	13
5	18

9.2.1 By Hand

We can compute the correlation coefficient \(r\) as follows:

\(r = \text{average of} \left[ (x \text{ in standard units}) \times (y \text{ in standard units}) \right]\)

Using \(\overline(x)\) to represent the average of \(x\) and \(n\) to represent the number of observations (5, in this case), we have

\(r = \dfrac{\frac{(x - \overline{x})}{\sqrt{\frac{(x - \overline{x})^2}{n}}} \times \frac{(y - \overline{y})}{\sqrt{\frac{(y - \overline{y})^2}{n}}}}{n}\).

We can implement this formula by creating the little table below and then averaging the final column of products.

x	y	x in SUs	y in SUs	product
1	10	-1.41	-1.32	1.87
3	15	0.00	0.51	0.00
2	12	-0.71	-0.59	0.41
4	13	0.71	-0.22	-0.16
5	18	1.41	1.61	2.28

The average of the final column is 0.88.

9.2.2 With R

In R, we can compute the corrlation between x and y using cor(x, y).

Note that dropping missing values is more complicated for pairs of data. If you want to drop missing values from the calculations, then cor(x, y, use = pairwise.complete.obs") is a good choice.

We can use the code below to find the correlation in the example above.

x <- c(1, 3, 2, 4, 5)
y <- c(10, 15, 12, 13, 18)

cor(x, y)

[1] 0.8814089

Exercise 9.2 Compute the correlation coefficient between each combination of the four variables below. Check your work with R.

x	y	z
2	8	7
4	0	3
5	5	5
6	3	6
4	6	6
3	5	3

9.3 Interpreting

In general, a correlation coefficient is NOT particularly useful. I introduce it for two reasons:

Other people use it.
We use it to obtain more useful quantities.

However, the correlation coefficient \(r\) has a concrete interpretation: If \(x\) is one SD larger, then \(y\) is \(r\) SDs larger on average.

We might also say that “a one SD increase in \(x\) leads to an \(r\) SD increase in \(y\) on average,” but we must take care that “leads to” describes a pattern in the data and does not describe a causal relationship.

9.4 Example: Clark and Golder (2006)

For a substantive example, consider Clark and Golder’s data.

# load parties dataset
parties_df <- read_rds("data/parties.rds") 

# compute correlation between enep and eneg for each electoral system
cor_df <- parties_df %>%
  group_by(electoral_system) %>%
  summarize(cor = cor(enep, eneg))

electoral_system	cor
Single-Member District	0.04
Small-Magnitude PR	0.45
Large-Magnitude PR	-0.02

ggplot(parties_df, aes(x = eneg, y = enep)) + 
  geom_point(alpha = 0.5) + 
  facet_wrap(vars(electoral_system)) + 
  geom_label(data = cor_df, aes(x = Inf, y = Inf, label = paste0("cor = ", round(cor, 2))),
             hjust = 1.1, vjust = 1.1) + 
  theme_bw()

As Clark and Golder expect, we get a correlation coefficient near zero in SMD systems. But contrary to their expectation, we also get a correlation coefficient near zero in large-magnitude PR systems.

Exercise 9.3 Interpret the correlation for small-magnitude PR systems above by filling in the following blanks:

A one SD increase in ENEG leads to a _____ SD increase in ENEP, on average.
A _____ unit increase in ENEG leads to a _____ unit increase in ENEP, on average.

Hint

How many units is one SD for ENEG? What about for ENEP? Going from SDs to the original units is like going from feet to yards: you just need to know how many feet are in a yard (or how many SDs are in each original unit).

9.5 Example: Feeling Thermometers

Below, I compute the correlation between feelings toward the Democratic and Republican parties. It makes sense that this correlation should be negative. As respondents’ feelings toward the Democratic party grow warmer, their feelings toward the Republican party should grow cooler. We might also expect this correlation to be stronger among more educated respondents and change over time.

The example below uses the therms dataset in the pos5737data package available on GitHub.

# get pos5737data (if updated)
devtools::install_github("pos5737/pos5737data")

# load data
data(therms, package = "pos5737data")

# quick look
glimpse(therms)

Rows: 38,100
Columns: 4
$ year                <dbl> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1…
$ ft_democratic_party <dbl> 80, 50, 40, 60, 85, 50, 70, NA, 60, NA, NA, 70, 8…
$ ft_republican_party <dbl> 50, 50, 60, 60, 60, 50, 40, NA, 60, NA, NA, 40, 8…
$ education           <fct> High School, 8th Grade or Less, High School, High…

# compute correlation between the two feelinging thermometers for 
#   each year and education level
smry_df <- therms %>%
  # drop observations where education is missing
  drop_na(education) %>%
  # compute correlation for each year-education subset
  group_by(year, education) %>%
  summarize(cor = cor(x = ft_democratic_party, 
                      y = ft_republican_party, 
                      use = "pairwise.complete.obs")) %>%
  # complete dataset by right-joining a dataset that has all years and all education levels combinations
  right_join(crossing(year = unique(therms$year), 
                      education = unique(therms$education))) %>%
  # add a variable for presidential elections--if the year is evenly divisible by 4
  mutate(election_type = ifelse(test = year %% 4 == 0, 
                                yes = "Presidential Election", 
                                no = "Congressional Election")) %>%
  glimpse()

Rows: 102
Columns: 4
Groups: year [17]
$ year          <dbl> 1978, 1978, 1978, 1978, 1978, 1978, 1980, 1980, 1980, 1…
$ education     <fct> 8th Grade or Less, Some High School, High School, High …
$ cor           <dbl> -0.219932553, -0.153398674, -0.154494861, -0.059715780,…
$ election_type <chr> "Congressional Election", "Congressional Election", "Co…

# plot correlations
ggplot(smry_df, aes(x = year, y = cor, color = election_type)) + 
  geom_point() + geom_line() + 
  facet_wrap(vars(education))

Exercise 9.4 Read the excerpt from Clark, Golder, and Golder on pp. 477-478. Download the gamson dataset from the data page. Compute the correlation coefficient \(r\) between seat and portfolio shares and create a scatterplot of the two. Comment briefly.

Solution

# load data
gamson_df <- read_rds("data/gamson.rds") 

# compute correlation coefficient
cor(x = gamson_df$seat_share, gamson_df$portfolio_share)

[1] 0.9423176

# create scatterplot
ggplot(gamson_df, aes(x = seat_share, y = portfolio_share)) + 
  geom_point()

Exercise 9.5 Use devtools::install_github("pos5737/pos5737data") to get the latest version of the pos5737 data package. Load the data set anscombe into R with data(anscombe, package = "pos5737data"). Use glimpse(anscombe) to get a quick look at the data. Realize that this one data frame actually contains four different datasets stacked on top of each other and numbered I, II, III, and IV.

Use a group_by() and summarize() workflow to compute a correlation coefficient for each of the four datasets. How do they compare? What do they suggest about the strength of the relationship between \(x\) and \(y\)?
Create a scatterplot of \(x\) and \(y\) with separate panels (e.g, facet_wrap()) for each dataset. How do they compare? How would you describe the strength of the relationship between \(x\) and \(y\) in each panel? Would you say that the correlation coefficient offered a good summary of each dataset?