# Chapter 13 Modeling continuous relationships

Most people are familiar with the concept of correlation, and in this chapter we will provide a more formal understanding for this commonly used and misunderstood concept.

## 13.1 An example: Hate crimes and income inequality

In 2017, the web site Fivethirtyeight.com published a story titled Higher Rates Of Hate Crimes Are Tied To Income Inequality which discussed the relationship between the prevalence of hate crimes and income inequality in the wake of the 2016 Presidential election. The story reported an analysis of hate crime data from the FBI and the Southern Poverty Law Center, on the basis of which they report:

“we found that income inequality was the most significant determinant of population-adjusted hate crimes and hate incidents across the United States”.

The data for this analysis are included in the fivethirtyeight R package, which makes it easy for us to access them. The analysis reported in the story focused on the relationship between income inequality (defined by a quantity called the Gini index) and the prevalence of hate crimes in each state.

### 13.1.1 Quantifying inequality: The Gini index

Before we look at the analysis reported in the story, it’s first useful to understand how the Gini index is used to quantify inequality. The Gini index is usually defined in terms of a curve that describes the relation between income and the proportion of the population that has income at or less than that level, known as a Lorenz curve. However, another way to think of it is more intuitive: It is the relative mean absolute difference between incomes, divided by two (from https://en.wikipedia.org/wiki/Gini_coefficient):

$G = \frac{\displaystyle{\sum_{i=1}^n \sum_{j=1}^n \left| x_i - x_j \right|}}{\displaystyle{2n\sum_{i=1}^n x_i}}$

Figure 13.1 shows the Lorenz curves for several different income distributions. The top left panel (A) shows an example with 10 people where everyone has exactly the same income. The length of the intervals between points are equal, indicating each person earns an identical share of the total income in the population. The top right panel (B) shows an example where income is normally distributed. The bottom left panel shows an example with high inequality; everyone has equal income ($40,000) except for one person, who has income of$40,000,000. According to the US Census, the United States had a Gini index of 0.469 in 2010, falling roughly half way between our normally distributed and maximally inequal examples.

## 13.3 Covariance and correlation

One way to quantify the relationship between two variables is the covariance. Remember that variance for a single variable is computed as:

$s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{N - 1}$

This tells us how far each observation is from the mean. Covariance tells us whether there is a relation between the deviations of two different variables across observations. It is defined as:

$covariance = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{N - 1}$

This value will be far from zero when x and y are both highly deviant from the mean; if they are deviant in the same direction then the covariance is positive, whereas if they are deviant in opposite directions the covariance is negative. Let’s look at a toy example first.

# create data for toy example of covariance
df <-
tibble(x = c(3, 5, 8, 10, 12)) %>%
mutate(y = x + round(rnorm(n = 5, mean = 0, sd = 2))) %>%
mutate(
y_dev = y - mean(y),
x_dev = x - mean(x)
) %>%
mutate(crossproduct = y_dev * x_dev)

pander(df)
x y y_dev x_dev crossproduct
3 1 -6.6 -4.6 30.36
5 3 -4.6 -2.6 11.96
8 8 0.4 0.4 0.16
10 12 4.4 2.4 10.56
12 14 6.4 4.4 28.16
# compute covariance

sprintf("sum of cross products = %.2f", sum(df$crossproduct)) ## [1] "sum of cross products = 81.20" covXY <- sum(df$crossproduct) / (nrow(df) - 1)
sprintf("covariance: %.2f", covXY)
## [1] "covariance: 20.30"

We don’t usually use the covariance to describe relationships between variables, because it varies with the overall level of variance in the data. Instead, we would usually use the correlation coefficient (often referred to as Pearson’s correlation after the statistician Karl Pearson). The correlation is computed by scaling the covariance by the standard deviations of the two variables:

$r = \frac{covariance}{s_xs_y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(N - 1)s_x s_y}$

# compute the correlation coefficient

corXY <- sum(df$crossproduct) / ((nrow(df) - 1) * sd(df$x) * sd(df$y)) sprintf("correlation coefficient = %.2f", corXY) ## [1] "correlation coefficient = 0.99" We can also compute the correlation value easily using the cor() function in R: # compute r using built-in function c <- cor(df$x, df$y) sprintf("correlation coefficient = %.2f", c) ## [1] "correlation coefficient = 0.99" The correlation coefficient is useful because it varies between -1 and 1 regardless of the nature of the data - in fact, we already discussed the correlation coefficient earlier in the discussion of effect sizes. As we saw in the previous chapter on effect sizes, a correlation of 1 indicates a perfect linear relationship, a correlation of -1 indicates a perfect negative relationship, and a correlation of zero indicates no linear relationship. We can compute the correlation coefficient for the hate crime data: corGiniHC <- cor( hateCrimes$gini_index,
hateCrimes$gini_index, method = "spearman") corTestSpearman ## ## Spearman's rank correlation rho ## ## data: hateCrimes$avg_hatecrimes_per_100k_fbi and hateCrimes$gini_index ## S = 20000, p-value = 0.8 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.033 Now we see that the correlation is no longer significant (and in fact is very near zero), suggesting that the claims of the FiveThirtyEight blog post may have been incorrect due to the effect of the outlier. ### 13.3.3 Bayesian correlation analysis We can also analyze the FiveThirtyEight data using Bayesian analysis, which has two advantages. First, it provides us with a posterior probability – in this case, the probability that the correlation value exceeds zero. Second, the Bayesian estimate combines the observed evidence with a prior, which has the effect of regularizing the correlation estimate, effectively pulling it towards zero. Here we can compute it using the jzs_cor function from the BayesMed package. bayesCor <- jzs_cor( hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index ) ## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 50 ## Unobserved stochastic nodes: 4 ## Total graph size: 230 ## ## Initializing model bayesCor ##$Correlation
## [1] 0.41
##
## $BayesFactor ## [1] 11 ## ##$PosteriorProbability
## [1] 0.92

Notice that the correlation estimated using the Bayesian method is slightly smaller than the one estimated using the standard correlation coefficient, which is due to the fact that the estimate is based on a combination of the evidence and the prior, which effectively shrinks the estimate toward zero. However, notice that the Bayesian analysis is not robust to the outlier, and it still says that there is fairly strong evidence that the correlation is greater than zero.

## 13.4 Correlation and causation

When we say that one thing causes another, what do we mean? There is a long history in philosophy of discussion about the meaning of causality, but in statistics one way that we commonly think of causation is in terms of experimental control. That is, if we think that factor X causes factor Y, then manipulating the value of X should also manipulate the value of Y.

In medicine, there is a set of ideas known as Koch’s postulates which have historically been used to determine whether a particular organism causes a disease. The basic idea is that the organism should be present in people with the disease, and not present in those without it – thus, a treatment that eliminates the organism should also eliminate the disease. Further, infecting someone with the organism should cause them to contract the disease. An example of this was seen in the work of Dr. Barry Marshall, who had a hypothesis that stomach ulcers were caused by a bacterium (Helicobacter pylori). To demonstrate this, he infected himself with the bacterium, and soon thereafter developed severe inflammation in his stomach. He then treated himself with an antibiotic, and his stomach soon recovered. He later won the Nobel Prize in Medicine for this work.

Often we would like to test causal hypotheses but we can’t actually do an experiment, either because it’s impossible (“What is the relationship between human carbon emissions and the earth’s climate?”) or unethical (“What are the effects of severe abuse on child brain development?”). However, we can still collect data that might be relevant to those questions. For example, in the latter example, we can potentially collect data from children who have been abused as well as those who have not, and we can then ask whether their brain development differs.

Let’s say that we did such an analysis, and we found that abused children had poorer brain development than non-abused children. Would this demonstrate that abuse causes poorer brain development? No. Whenever we observe a statistical association between two variables, it is certainly possible that one of those two variables causes the other. However, it is also possible that both of the variables are being influenced by a third variable; in this example, it could be that child abuse is associated with family stress, which could also cause poorer brain development through less intellectual engagement, food stress, or many other possible avenues. The point is that a correlation between two variables generally tells us that something is causing somethign else, but it doesn’t tell us what is causing what. As the statistician Edward Tufte says, “Correlation does not imply causation, but it’s a pretty good hint.”

### 13.4.1 Causal graphs

One useful way to describe causal relations between variables is through a causal graph, which shows variables as circles and causal relations between them as arrows. For example, Figure 13.5 shows the causal relationships between study time and two variables that we think should be affected by it: exam grades and exam finishing times.

Figure 13.5: A graph showing causal relationships between three variables: study time, exam grades, and exam finishing time. A green arrow represents a positive relationship (i.e. more study time causes exam grades to increase), and a red arrow represents a negative relationship (i.e. more study time causes faster completion of the exam).

However, in reality the effects on finishing time and grades are not due directly to the amount of time spent studying, but rather to the amount of knowledge that the student gains by studying. We would usually say that knowledge is a latent variable – that is, we can’t measure it directly but we can see it reflected in variables that we can measure (like grades and finishing times). Figure 13.6 shows this.

Figure 13.6: A graph showing the same causal relationships as above, but now also showing the latent variable (knowledge) using a square box.

Here we would say that knowledge mediates the relationship between study time and grades/finishing times. That means that if we were able to hold knowledge constant (for example, by administering a drug that causes immediate forgetting), then the amount of study time should no longer have an effect on grades and finishing times.

Note that if we simply measured exam grades and finishing times we would generally see negative relationship between them, because people who finish exams the fastest in general get the highest grades. However, if we were to interpret this correlation as a causal relation, this would tell us that in order to get better grades, we should actually finish the exam more quickly! This example shows how tricky the inference of causality from non-experimental data can be.

Within statistics and machine learning, there is a very active research community that is currently studying the question of when and how we can infer causal relationships from non-experimental data. However, these methods often require strong assumptions, and must generally be used with great caution.